首页 > 编程 > Python > 正文

python爬取小说

2019-11-10 22:46:48
字体:
来源:转载
供稿:网友

1.首先我们选取要爬取一个小说网站

2.分析他的搜索时的url

我选取的是顶点小说网(http://www.23us.so/),这个网站搜索时的url是这样的

'http://zhannei.baidu.com/cse/search?s={0}&entry=1&q={1}&isNeedCheckDomain=1&jump=1'

这个url里面的{0},{1}是针对于我们每个用户和搜索的关键词而不同的,但其实只要在该网站搜索几次,就可以发现其中的规律,{0}的值是固定,这个固定是对用户而言,是不改变的,{1}的内容就是搜索的关键词。通过这样的分析,我们可以很容一构造出来针对不同小说的url.

3.顺着分析出来url,获取搜索结果

通过对搜索结果页的分析,获取我们想获取的内容,章节目录的链接,最新章节等等,我只获取章节的链接,只对第一个搜索结果进行分析。

4.获取小说目录

通过3的目录链接,我们顺利爬到目录页,其实我们以及是成功,这一夜,继续对页面源码分析出每个章节的链接就大功告成了。

5.获取小说内容

重复原先的操作,分析页面源码,爬取出来小说的内容。

6.代码

# -*- coding: utf-8 -*-import sysimport urllibimport urllib2import lxml.htmlimport timeimport cookielib# 防止中文乱码reload(sys)sys.setdefaultencoding('utf8')class dingdian_crawler(object): search_data = {} search_url_template = 'http://zhannei.baidu.com/cse/search?s={0}&entry=1&q={1}&isNeedCheckDomain=1&jump=1' catalog = {} catalog_key = {} lasted_article = list() def __init__(self,search_content = None,search_id = 17233375349940438896): self.search_data['content'] = search_content self.search_data['id'] = search_id self.search_url =self.search_url_template.format(self.search_data['id'], / self.search_data['content']).decode('utf-8') # cj = cookielib.CookieJar() # self.opener = urllib2.build_opener(urllib2.HTTPCookiePRocessor(cj)) # self.opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36')] def load_search_data(self,search_content,search_id = 17233375349940438896): self.__init__(search_content,search_id) def get_catalog(self): print self.search_url search_html = self.download_html(self.search_url) search_tree = lxml.html.fromstring(search_html) search_result = search_tree.CSSselect('a[cpos="title"]') print len(search_result) catalog_url = search_result[0].get('href') print catalog_url catalog_html = self.download_html(catalog_url) catalog_tree = lxml.html.fromstring(catalog_html) catalog_elems = catalog_tree.cssselect('td.L a') num = len(catalog_elems) catalog_file = open('catalog.txt','w') i = 0 for elem in catalog_elems: self.catalog[elem.text.decode('utf-8')] = elem.get('href') self.catalog_key[i] = elem.text.decode('utf-8') i+=1 self.lasted_article = list() self.lasted_article.append(self.catalog_key[num-1]) self.lasted_article.append(self.catalog[self.catalog_key[num-1]]) print len(self.catalog) def get_novel_content(self,start,num=0): if start+int(num)>len(self.catalog_key): print 'maybe out of article list range' print 'please reset start or num' return if num == 0: url = self.get_article_url(start) article_html = self.download_html(url) soup = BeautifulSoup(article_html,'html.parser') article = soup.find('dd',attrs={'id':'contents'}) content = '/n/n/n'+self.catalog_key[start-1]+'/n/n/n' content = content+article.text.decode('utf-8') file_name = self.search_data['content']+str(start)+'.txt' self.save_article(file_name,[content,]) else: i = 0 contents = [] while i<num: content = '' url = self.get_article_url(start+i) article_html = self.download_html(url) soup = BeautifulSoup(article_html,'html.parser') article = soup.find('dd',attrs={'id':'contents'}) content = '/n/n/n'+self.catalog_key[i+start-1]+'/n/n/n' content = content+article.text.decode('utf-8') contents.append(content) i = i+1; file_name = self.search_data['content']+str(start)+'-'+str(start+num-1)+'.txt' self.save_article(file_name,contents) def get_article_url(self,index): try: print self.catalog[self.catalog_key[index-1]] return self.catalog[self.catalog_key[index-1]] except Exception: print 'out of article list range' print 'please reset index' def print_search_url(self): print self.search_url def save_article(self,name,articles): wfile = open(name,'w') for article in articles: wfile.write(article) wfile.close def download_html(self,url,times=5): try: html = urllib2.urlopen(url).read() print 'download success' return html except Exception as e: if times: print 'download fail' print 'try download again:',(6-times) times = times-1 time.sleep(2) return self.download_html(url,times) else: print 'crawl failed' sys.exit(1)if __name__ == '__main__': crawler = dingdian_crawler('神魔乐园'.decode('utf-8')) crawler.get_catalog() print crawler.lasted_article[0].decode('utf-8')

7.简单解释下代码

search_id对于每个人是不一样,大家可以直接搜索观察出来search_id,也可以用工具分析,网页post的sid的值就是代码中search_id 的值

8.最后

代码的可读性可能不好,见谅

有一点不是很了解,市场是小说应用软件,的换源操作,对于不同的网站都有一个对应的爬虫,还是一个统一的爬虫,或者说就不是爬虫。不太清楚,有空可以了解,自己做一个小说应用软件玩玩。


发表评论 共有条评论
用户名: 密码:
验证码: 匿名发表