限制/节流 GRequest 中 HTTP 请求的速率 [英] Limiting/throttling the rate of HTTP requests in GRequests
问题描述
我正在用 Python 2.7.3 编写一个带有 GRequests 和 lxml 的小脚本,这将允许我从各种网站收集一些收藏卡价格并进行比较.问题是其中一个网站限制了请求的数量,如果超过,则返回 HTTP 错误 429.
I'm writing a small script in Python 2.7.3 with GRequests and lxml that will allow me to gather some collectible card prices from various websites and compare them. Problem is one of the websites limits the number of requests and sends back HTTP error 429 if I exceed it.
有没有办法在 GRequestes 中添加限制请求数,这样我就不会超过我指定的每秒请求数?另外 - 如果发生 HTTP 429,我如何让 GRequest 在一段时间后重试?
Is there a way to add throttling the number of requests in GRequestes so that I don't exceed the number of requests per second I specify? Also - how can I make GRequestes retry after some time if HTTP 429 occurs?
附带说明 - 他们的限制低得离谱.类似于每 15 秒 8 个请求.我多次使用浏览器破坏它,只是刷新页面等待价格变化.
On a side note - their limit is ridiculously low. Something like 8 requests per 15 seconds. I breached it with my browser on multiple occasions just refreshing the page waiting for price changes.
推荐答案
要回答我自己的问题,因为我必须自己解决这个问题,而且似乎很少有关于此的信息.
Going to answer my own question since I had to figure this by myself and there seems to be very little info on this going around.
思路如下.与 GRequest 一起使用的每个请求对象在创建时都可以将会话对象作为参数.另一方面,会话对象可以安装在发出请求时使用的 HTTP 适配器.通过创建我们自己的适配器,我们可以拦截请求并以我们认为最适合我们的应用程序的方式对它们进行速率限制.就我而言,我最终得到了下面的代码.
The idea is as follows. Every request object used with GRequests can take a session object as a parameter when created. Session objects on the other hand can have HTTP adapters mounted that are used when making requests. By creating our own adapter we can intercept requests and rate-limit them in way we find best for our application. In my case I ended up with the code below.
用于限制的对象:
DEFAULT_BURST_WINDOW = datetime.timedelta(seconds=5)
DEFAULT_WAIT_WINDOW = datetime.timedelta(seconds=15)
class BurstThrottle(object):
max_hits = None
hits = None
burst_window = None
total_window = None
timestamp = None
def __init__(self, max_hits, burst_window, wait_window):
self.max_hits = max_hits
self.hits = 0
self.burst_window = burst_window
self.total_window = burst_window + wait_window
self.timestamp = datetime.datetime.min
def throttle(self):
now = datetime.datetime.utcnow()
if now < self.timestamp + self.total_window:
if (now < self.timestamp + self.burst_window) and (self.hits < self.max_hits):
self.hits += 1
return datetime.timedelta(0)
else:
return self.timestamp + self.total_window - now
else:
self.timestamp = now
self.hits = 1
return datetime.timedelta(0)
HTTP 适配器:
class MyHttpAdapter(requests.adapters.HTTPAdapter):
throttle = None
def __init__(self, pool_connections=requests.adapters.DEFAULT_POOLSIZE,
pool_maxsize=requests.adapters.DEFAULT_POOLSIZE, max_retries=requests.adapters.DEFAULT_RETRIES,
pool_block=requests.adapters.DEFAULT_POOLBLOCK, burst_window=DEFAULT_BURST_WINDOW,
wait_window=DEFAULT_WAIT_WINDOW):
self.throttle = BurstThrottle(pool_maxsize, burst_window, wait_window)
super(MyHttpAdapter, self).__init__(pool_connections=pool_connections, pool_maxsize=pool_maxsize,
max_retries=max_retries, pool_block=pool_block)
def send(self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None):
request_successful = False
response = None
while not request_successful:
wait_time = self.throttle.throttle()
while wait_time > datetime.timedelta(0):
gevent.sleep(wait_time.total_seconds(), ref=True)
wait_time = self.throttle.throttle()
response = super(MyHttpAdapter, self).send(request, stream=stream, timeout=timeout,
verify=verify, cert=cert, proxies=proxies)
if response.status_code != 429:
request_successful = True
return response
设置:
requests_adapter = adapter.MyHttpAdapter(
pool_connections=__CONCURRENT_LIMIT__,
pool_maxsize=__CONCURRENT_LIMIT__,
max_retries=0,
pool_block=False,
burst_window=datetime.timedelta(seconds=5),
wait_window=datetime.timedelta(seconds=20))
requests_session = requests.session()
requests_session.mount('http://', requests_adapter)
requests_session.mount('https://', requests_adapter)
unsent_requests = (grequests.get(url,
hooks={'response': handle_response},
session=requests_session) for url in urls)
grequests.map(unsent_requests, size=__CONCURRENT_LIMIT__)
这篇关于限制/节流 GRequest 中 HTTP 请求的速率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!