使用python进行Google抓取-请求:如何避免因请求过多而被阻止? [英] Google scraping using python - requests: How to avoid being blocked due to many requests?

查看:63
本文介绍了使用python进行Google抓取-请求:如何避免因请求过多而被阻止?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于学校项目,我需要获取 200 家公司的网址(基于列表).我的脚本工作正常,但是当我在公司 80 左右时,我被谷歌屏蔽了.这是我得到的消息.

For a school project I need get the web addresses of 200 companies (based on a list). My script is working fine, but when I'm around the company 80, I get blocked by google. This is the message that I'm getting.

> Our systems have detected unusual traffic from your computer network. 
> This page checks to see if it's really you sending the requests, and
> not a robot.  <a href="#"
> onclick="document.getElementById('infoDiv').style.display='block'

我尝试了两种不同的方式来获取我的数据:

I tried two different ways to get my data:

一个简单的:

for company_name in data:
     search = company_name
     results = 1
     page = requests.get("https://www.google.com/search?q={}&num={}".format(search, results))

     soup = BeautifulSoup(page.content, "html5lib")

还有一个更复杂的:

for company_name in data:
    search = company_name
    results = 1

    s = requests.Session()
    retries = Retry(total=3, backoff_factor=0.5)
    s.mount('http://', HTTPAdapter(max_retries=retries))
    s.mount('https://', HTTPAdapter(max_retries=retries))
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
    #time.sleep(.600)

    soup = BeautifulSoup(page.content, "html5lib")

但我一遍又一遍地犯同样的错误.有什么办法可以克服这个问题吗?谢谢!

But I'm getting the same mistake over and over. Is there a way I could overcome this issue? thanks!

推荐答案

如果您只想确保每 0.6 秒发出的请求不超过 1 个,您只需要休眠直到距离上次请求至少 0.6 秒请求.

If you just want to make sure you never make more than 1 request every 0.6 seconds, you just need to sleep until it's been at least 0.6 seconds since the last request.

如果处理每个请求所需的时间仅为 0.6 秒的一小部分,您可以取消注释代码中已有的行.但是,在循环的末尾而不是在中间执行它可能更有意义:

If the amount of time it takes you to process each request is a tiny fraction of 0.6 seconds, you can uncomment the line already in your code. However, it probably makes more sense to do it at the end of the loop, rather than in the middle:

for company_name in data:
    # blah blah
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
    soup = BeautifulSoup(page.content, "html5lib")
    # do whatever you wanted with soup
    time.sleep(.600)

<小时>

如果您的处理需要 0.6 秒的相当大的一部分,那么等待 0.6 秒就太长了.例如,如果有时需要 0.1 秒,有时需要 1.0 秒,那么您想在第一种情况下等待 0.5 秒,但在第二种情况下根本不需要,对吗?


If your processing takes a sizable fraction of 0.6 seconds, then waiting 0.6 seconds is too long. For example, if it sometimes takes 0.1 seconds, sometimes 1.0, then you want to wait 0.5 seconds in the first case, but not at all in the second, right?

在这种情况下,只需跟踪您上次发出请求的时间,并在此之后的 0.6 秒内休眠:

In that case, just keep track of the last time you made a request, and sleep until 0.6 seconds after that:

last_req = time.time()
for company_name in data:
    # blah blah
    page = s.get("https://www.google.com/search?q={}&num={}".format(search, results))
    soup = BeautifulSoup(page.content, "html5lib")
    # do whatever you wanted with soup

    now = time.time()
    delay = last_req + 0.600 - now
    last_req = now
    if delay >= 0:
        time.sleep(delay)

<小时>

如果您需要恰好每 0.6 秒(或尽可能接近该时间)发出一次请求,您可以启动执行此操作的线程,并将结果放入队列,而另一个线程(可能是您的主线程)只是阻止从该队列中弹出请求并处理它们.


If you need to make requests exactly once every 0.6 seconds—or as close to that as possible—you could kick off a thread that does that, and tosses the results in a queue, while another thread (possibly your main thread) just blocks popping requests off that queue and processing them.

但我无法想象你为什么需要它.

But I can't imagine why you'd need that.

这篇关于使用python进行Google抓取-请求:如何避免因请求过多而被阻止?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆