在 Python 中抓取 - 防止 IP 禁令 [英] Scraping in Python - Preventing IP ban

查看:42
本文介绍了在 Python 中抓取 - 防止 IP 禁令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python 来抓取页面.到目前为止,我没有遇到任何复杂的问题.

I am using Python to scrape pages. Until now I didn't have any complicated issues.

我尝试抓取的网站使用了大量安全检查,并有一些机制来防止抓取.

The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.

使用 Requestslxml 在被 IP 禁止之前,我能够抓取大约 100-150 个页面.有时我什至禁止第一个请求(新 IP,以前没有使用过,不同的 C 块).我试过欺骗头,随机化请求之间的时间,还是一样.

Using Requests and lxml I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.

我已经尝试过使用 Selenium 并且得到了更好的结果.使用 Selenium,我能够在被禁止之前抓取大约 600-650 页.在这里,我还尝试将请求随机化(在 3-5 秒之间,并在每 300 个请求上调用 time.sleep(300) ).尽管如此,我还是被禁止了.

I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300) call on every 300th request). Despite that, Im getting banned.

从这里我可以得出结论,如果站点在一个打开的浏览器会话中请求的页面超过 X 个页面或类似的东西,他们会禁止 IP 的某种机制.

From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.

根据您的经验,我还应该尝试什么?在 Selenium 帮助中关闭和打开浏览器(例如在每 100 个请求关闭和打开浏览器之后).我正在考虑尝试使用代理,但大约有数百万页,而且会非常庞大​​.

Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.

推荐答案

如果你想切换到 Scrapy 网页抓取框架,您将能够重用许多为防止和解决禁令而制作的东西:

If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:

这是根据 Scrapy 服务器和您正在抓取的网站的负载自动限制抓取速度的扩展.

This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆