首页 > 学院 > 开发设计 > 正文

入门级爬虫 抓取豆瓣top250 的电影信息

2019-11-06 08:33:12
字体:
来源:转载
供稿:网友
import requestsimport lxml.htmlfrom bs4 import BeautifulSoupimport reimport bs4from pymongo import MongoClientdef req(url, param): resp = requests.get(url, params=param).text return respdef get_data(data): #得到你要抓取内容然块 source_soup = BeautifulSoup(data, 'html.parser') data_ol = source_soup.ol films = [] for tag_li in data_ol: if isinstance(tag_li, bs4.element.Tag): datas = lxml.html.fromstring(str(tag_li.contents)) #得到电影名字 names = [] name1 = datas.xpath('//span[@class="title"]/text()') name2 = datas.xpath('//span[@class="other"]/text()') names.append(name1) names.append(name2) #得到电影导演及主演的信息 info = datas.xpath('//p[@class=""]/text()') #得到电影的评分及评分人数 star = datas.xpath('//span[@class="rating_num"]/text()') num = re.search('<span>(.*)</span>', str(data_ol.contents)).group(1) #得到电影的名句 quote = datas.xpath('//span[@class="inq"]/text()') #将信息存入一个字典 film_info = { 'name': names, 'info': info, 'star': star, 'num': num, 'quote': quote } films.append(film_info) return filmscli = MongoClient('localhost', 27017)db = cli.filmsfor i in range(1, 11): param = { 'start': (i - 1) * 25, 'filter': "" } url = 'https://movie.douban.com/top250' db.films2.insert(get_data(req(url, param)))PRint("spider success")

使用bs4, lxml.html.xpath, requests 还请各位看客多多指教,


发表评论 共有条评论
用户名: 密码:
验证码: 匿名发表