利用Python搭建的简易排序搜索引擎

2019-11-08 19:46:06

字体：大中小

来源：转载

供稿：网友

本文源代码转自搜索引擎原理，博主进行整理调BUG并进行注释，对于Python初学者来说是了解爬虫、网页排序算法非常好的素材。

首先来介绍一下PageRank网页排序算法（注：转自PageRank算法简介及Map-Reduce实现，详情点击链接）：

PageRank对网页排名的算法，曾是Google发家致富的法宝。以前虽然有实验过，但理解还是不透彻，这几天又看了一下，这里总结一下PageRank算法的基本原理。

一、什么是pagerank

PageRank的Page可是认为是网页，表示网页排名，也可以认为是Larry Page(google 产品经理)，因为他是这个算法的发明者之一，还是google CEO（^_^）。PageRank算法计算每一个网页的PageRank值，然后根据这个值的大小对网页的重要性进行排序。它的思想是模拟一个悠闲的上网者，上网者首先随机选择一个网页打开，然后在这个网页上呆了几分钟后，跳转到该网页所指向的链接，这样无所事事、漫无目的地在网页上跳来跳去，PageRank就是估计这个悠闲的上网者分布在各个网页上的概率。

二、最简单pagerank模型

互联网中的网页可以看出是一个有向图，其中网页是结点，如果网页A有链接到网页B，则存在一条有向边A->B，下面是一个简单的示例：

这个例子中只有四个网页，如果当前在A网页，那么悠闲的上网者将会各以1/3的概率跳转到B、C、D，这里的3表示A有3条出链，如果一个网页有k条出链，那么跳转任意一个出链上的概率是1/k，同理D到B、C的概率各为1/2，而B到C的概率为0。一般用转移矩阵表示上网者的跳转概率，如果用n表示网页的数目，则转移矩阵M是一个n*n的方阵；如果网页j有k个出链，那么对每一个出链指向的网页i，有M[i][j]=1/k，而其他网页的M[i][j]=0；上面示例图对应的转移矩阵如下：

初试时，假设上网者在每一个网页的概率都是相等的，即1/n，于是初试的概率分布就是一个所有值都为1/n的n维列向量V0，用V0去右乘转移矩阵M，就得到了第一步之后上网者的概率分布向量MV0,（nXn）*(nX1)依然得到一个nX1的矩阵。下面是V1的计算过程：

注意矩阵M中M[i][j]不为0表示用一个链接从j指向i，M的第一行乘以V0，表示累加所有网页到网页A的概率即得到9/24。得到了V1后，再用V1去右乘M得到V2，一直下去，最终V会收敛，即Vn=MV(n-1)，上面的图示例，不断的迭代，最终V=[3/9,2/9,2/9,2/9]’：

现在应该能够基本理解PageRank算法了，那么附上代码：

#the search engine is divided into 3 modules:web crawl,build and use of index,page rank#----------------------------web_crawl--------------------------------def get_page(url):    try:        import urllib #引入urllib库        return urllib.urlopen(url).read() #打开网站并读取    except:        return ""  #抛出异常def get_next_target(page): #获取下一个链接网站    start_link = page.find('"') #str.find(str, beg=0 end=len(string)) 如果找到此方法返回的索引，否则返回-1;参数：此选项指定要搜索的字符串/这是开始索引，默认情况下为 0/这是结束索引，默认情况下它等于字符串的长度    if start_link == -1:        return None, 0    start_quote = page.find('"', start_link)    end_quote = page.find('"', start_quote + 1)    url = page[start_quote + 1:end_quote]    return url, end_quotedef get_all_links(page): #获取网页上面所有的url    links = [] #存取url的数组    while True:        url, endpos = get_next_target(page)        if url:            links.append(url)            page = page[endpos:]        else:            break    return linksdef union(a, b): #并查集    for e in b:        if e not in a:            a.append(e)def crawl_web(seed): # returns index, graph of inlinks    tocrawl = [seed]    crawled = []    graph = {}  # , [list of pages it links to]    index = {}    while tocrawl:        page = tocrawl.pop()        if page not in crawled:            content = get_page(page)            add_page_to_index(index, page, content)            outlinks = get_all_links(content)            graph[page] = outlinks            union(tocrawl, outlinks)            crawled.append(page)    return index, graph#--------------------------------build_index-----------------------------------def add_page_to_index(index, url, content):    Words = content.split() #str.split(str="", num=string.count(str)) 返回分割后的字符串列表    for word in words:        PRint word        add_to_index(index, word, url) def add_to_index(index, keyword, url):    if keyword in index:        index[keyword].append(url) #存储url，类似map    else:        index[keyword] = [url] def lookup(index, keyword): #查找索引    if keyword in index:        return index[keyword]    else:        return None    #---------------------------------page_rank---------------------------------def compute_ranks(graph):    d = 0.8 # damping factor    numloops = 10    ranks = {}    npages = len(graph)    for page in graph:         ranks[page] = 1.0 / npages    for i in range(0, numloops):        newranks = {}        for page in graph:            newrank = (1 - d) / npages            for node in graph:                if page in graph[node]:                    newrank = newrank + d * (ranks[node] / len(graph[node]))            newranks[page] = newrank        ranks = newranks    return ranksdef quick_sort(url_lst,ranks): #对网页权值进行快排, 不稳定    url_sorted_worse=[]    url_sorted_better=[]    if len(url_lst)<=1:        return url_lst    pivot=url_lst[0]    for url in url_lst[1:]:        if ranks[url]<=ranks[pivot]:            url_sorted_worse.append(url)        else:            url_sorted_better.append(url)    return quick_sort(url_sorted_better,ranks)+[pivot]+quick_sort(url_sorted_worse,ranks)def ordered_search(index, ranks, keyword): #搜索结果排序    if keyword in index:        all_urls=index[keyword]    else:        return None    return quick_sort(all_urls,ranks)#---------------------------------test-----------------------------------index, graph = crawl_web('http://udacity.com/cs101x/urank/index.html')ranks = compute_ranks(graph)print ordered_search(index, ranks, 'Hummus')#>>> ['http://udacity.com/cs101x/urank/kathleen.html',#    'http://udacity.com/cs101x/urank/nickel.html',#    'http://udacity.com/cs101x/urank/arsenic.html',#    'http://udacity.com/cs101x/urank/hummus.html',#    'http://udacity.com/cs101x/urank/index.html']#print ordered_search(index, ranks, 'the')#>>> ['http://udacity.com/cs101x/urank/nickel.html',#    'http://udacity.com/cs101x/urank/arsenic.html',#    'http://udacity.com/cs101x/urank/hummus.html',#    'http://udacity.com/cs101x/urank/index.html']#print ordered_search(index, ranks, 'babaganoush')#>>> None
代码经调试可用，但性能有限，爬取的网页会经常出现乱码，源网页的编码方式不得而知。乱码的出现直接导致了最后输出的网页排名有缺漏，目前这个问题还没有解决，争取在之后的调试中解决这个问题。哈哈哈，祝大家新年快乐，好好学习，天天向上~<(￣ˇ￣)/<(￣ˇ￣)/

上一篇：python使用win32*模块模拟人工操作——城通网盘下载器（一）

下一篇：python安装requests库