首页 > 学院 > 开发设计 > 正文

Avoiding getting banned(Scrapy)

2019-11-08 02:49:03
字体:
来源:转载
供稿:网友

Avoiding getting banned Some websites implement certain measures to PRevent bots from crawling them, with varying degrees of sophistication. Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.

Here are some tips to keep in mind when dealing with these kinds of sites:

rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them) disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour use download delays (2 or higher). See DOWNLOAD_DELAY setting. if possible, use Google cache to fetch pages, instead of hitting the sites directly use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alterantive is scrapoxy, a super proxy that you can attach your own proxies to. use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera If you are still unable to prevent your bot getting banned, consider contacting commercial support.


发表评论 共有条评论
用户名: 密码:
验证码: 匿名发表