request.get的绕过速率限制 [英] Bypass rate limit for requests.get

查看:107
本文介绍了request.get的绕过速率限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想不断抓取网站-每3-5秒抓取一次

I want to constantly scrape a website - once every 3-5 seconds with

requests.get('http://www.example.com', headers=headers2, timeout=35).json()

但是示例网站有一个速率限制,我想绕过它.我该怎么办?我曾考虑过要使用代理服务器,但是希望还有其他方法吗?

But the example website has a rate limit and I want to bypass that. How can I do so?? I thought about doing it with proxies but was hoping there were some other ways?

推荐答案

您将不得不做一些非常底层的工作.使用可能的套接字和urllib2.
首先做你的研究.它们如何限制您的查询率?是通过IP还是基于会话的(服务器端Cookie)还是本地Cookie?我建议您作为研究的第一步,手动访问该网站,并使用网络开发人员工具查看所有已传达的标头.

You would have to do some very low level stuff. Utilizing likely socket and urllib2.
First do your research. How are they limiting your query rate? Is it by IP, or session based (server side cookie) or local cookies? I suggest going to the site manually as your first step of research, and using a web-developer tool to view all headers communicated.

您可以弄清楚这一点,创建一个计划来对其进行操作.可以说它是基于会话的,您可以利用多个线程来控制刮板的多个单独实例,每个实例具有唯一的会话.

One you figure this out, create a plan to manipulate it. Lets say it is session based, you could utilize multiple threads to control several individual instances of a scraper, each with unique sessions.

现在,如果它是基于IP的,那么您就必须欺骗您的IP,这要复杂得多.

Now, if it is IP based, then you must spoof your IP which is much more complex.

这篇关于request.get的绕过速率限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆