在 Python 中抓取 - 防止 IP 禁令 [英] Scraping in Python - Preventing IP ban
问题描述
我正在使用 Python
来抓取页面.到目前为止,我没有遇到任何复杂的问题.
I am using Python
to scrape pages. Until now I didn't have any complicated issues.
我尝试抓取的网站使用了大量安全检查,并有一些机制来防止抓取.
The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.
使用 Requests
和 lxml
在被 IP 禁止之前,我能够抓取大约 100-150 个页面.有时我什至禁止第一个请求(新 IP,以前没有使用过,不同的 C 块).我试过欺骗头,随机化请求之间的时间,还是一样.
Using Requests
and lxml
I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.
我已经尝试过使用 Selenium 并且得到了更好的结果.使用 Selenium,我能够在被禁止之前抓取大约 600-650 页.在这里,我还尝试将请求随机化(在 3-5 秒之间,并在每 300 个请求上调用 time.sleep(300)
).尽管如此,我还是被禁止了.
I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300)
call on every 300th request). Despite that, Im getting banned.
从这里我可以得出结论,如果站点在一个打开的浏览器会话中请求的页面超过 X 个页面或类似的东西,他们会禁止 IP 的某种机制.
From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.
根据您的经验,我还应该尝试什么?在 Selenium 帮助中关闭和打开浏览器(例如在每 100 个请求关闭和打开浏览器之后).我正在考虑尝试使用代理,但大约有数百万页,而且会非常庞大.
Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.
推荐答案
如果你想切换到 Scrapy
网页抓取框架,您将能够重用许多为防止和解决禁令而制作的东西:
If you would switch to the Scrapy
web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:
这是根据 Scrapy 服务器和您正在抓取的网站的负载自动限制抓取速度的扩展.
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
- 使用
scrapy-fake 旋转用户代理
-useragent 中间件: - rotating user agents with
scrapy-fake-useragent
middleware: 轮换 IP 地址:
您也可以通过本地代理和运行它托:
you can also run it via local proxy & TOR:
这篇关于在 Python 中抓取 - 防止 IP 禁令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
每次请求使用 fake-useragent 提供的随机 User-Agent
Use a random User-Agent provided by fake-useragent every request