关于反爬虫的一些简单总结

2020-02-16 11:03:49

字体：大中小

来源：转载

供稿：网友

爬虫与反爬虫，这相爱相杀的一对，简直可以写出一部壮观的斗争史。而在大数据时代，数据就是金钱，很多企业都为自己的网站运用了反爬虫机制，防止网页上的数据被爬虫爬走。然而，如果反爬机制过于严格，可能会误伤到真正的用户请求；如果既要和爬虫死磕，又要保证很低的误伤率，那么又会加大研发的成本。

简单低级的爬虫速度快，伪装度低，如果没有反爬机制，它们可以很快的抓取大量数据，甚至因为请求过多，造成服务器不能正常工作。

1、爬取过程中的302重定向

在爬取某个网站速度过快或者发出的请求过多的时候，网站会向你所在的客户端发送一个链接，需要你去验证图片。我在爬链家和拉钩网的过程中就曾经遇到过：

对于302重定向的问题，是由于抓取速度过快引起网络流量异常，服务器识别出是机器发送的请求，于是将请求返回链接定到某一特定链接，大多是验证图片或空链接。

在这种时候，既然已经被识别出来了，就使用代理ip再继续抓取。

2、headers头文件

有些网站对爬虫反感，对爬虫请求一律拒绝，这时候我们需要伪装成浏览器，通过修改http中的headers来实现

headers = {'Host': "bj.lianjia.com",'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",'Accept-Encoding': "gzip, deflate, sdch",'Accept-Language': "zh-CN,zh;q=0.8",'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.87 Safari/537.36",'Connection': "keep-alive",}p = requests.get(url, headers=headers)print(p.content.decode('utf-8'))

3、模拟登陆

一般登录的过程都伴随有验证码，这里我们通过selenium自己构造post数据进行提交，将返回验证码图片的链接地址输出到控制台下，点击图片链接识别验证码，输入验证码并提交，完成登录。

from selenium import webdriverfrom selenium.webdriver.common.keys import Keys #from selenium.webdriver.support.ui import WebDriverWait # WebDriverWait的作用是等待某个条件的满足之后再往后运行from selenium.webdriver import ActionChainsimport timeimport sysdriver = webdriver.PhantomJS(executable_path='C:/PyCharm 2016.2.3/phantomjs/phantomjs.exe') # 构造网页驱动driver.get('https://www.zhihu.com/#signin')  # 打开网页driver.find_element_by_xpath('//input[@name="password"]').send_keys('your_password')driver.find_element_by_xpath('//input[@name="account"]').send_keys('your_account')driver.get_screenshot_as_file('zhihu.jpg')     # 截取当前页面的图片input_solution = input('请输入验证码 :')driver.find_element_by_xpath('//input[@name="captcha"]').send_keys(input_solution)time.sleep(2)driver.find_element_by_xpath('//form[@class="zu-side-login-box"]').submit() # 表单的提交 表单的提交，即可以选择登录按钮然后使用click方法，也可以选择表单然后使用submit方法sreach_widonw = driver.current_window_handle  # 用来定位当前页面# driver.find_element_by_xpath('//button[@class="sign-button submit"]').click()try:dr = WebDriverWait(driver,5)# dr.until(lambda the_driver: the_driver.find_element_by_xpath('//a[@class="zu-side-login-box"]').is_displayed())if driver.find_element_by_xpath('//*[@id="zh-top-link-home"]'):print('登录成功')except:print('登录失败')driver.save_screenshot('screen_shoot.jpg')  #截取当前页面的图片sys.exit(0)driver.quit() #退出驱动

上一篇：Python语言描述KNN算法与Kd树

下一篇：Python数据结构与算法之图的基本实现及迭代器实例详解