gevent/请求挂起,同时发出大量请求 [英] gevent / requests hangs while making lots of head requests

查看:83
本文介绍了gevent/请求挂起,同时发出大量请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要发出10万个头请求,并且在请求之上使用gevent.我的代码运行了一段时间,但最终挂起.我不确定为什么要挂它,或者它是否挂在请求或gevent中.我在请求和gevent中都使用了timeout参数.

I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent.

请查看下面的代码段,并告诉我应该更改的内容.

Please take a look at my code snippet below, and let me know what I should change.

import gevent
from gevent import monkey, pool
monkey.patch_all()
import requests

def get_head(url, timeout=3):
    try:
        return requests.head(url, allow_redirects=True, timeout=timeout)
    except:
        return None

def expand_short_urls(short_urls, chunk_size=100, timeout=60*5):
    chunk_list = lambda l, n: ( l[i:i+n] for i in range(0, len(l), n) )
    p = pool.Pool(chunk_size)
    print 'Expanding %d short_urls' % len(short_urls)
    results = {}
    for i, _short_urls_chunked in enumerate(chunk_list(short_urls, chunk_size)):
        print '\t%d. processing %d urls @ %s' % (i, chunk_size, str(datetime.datetime.now()))
        jobs = [p.spawn(get_head, _short_url) for _short_url in _short_urls_chunked]
        gevent.joinall(jobs, timeout=timeout)
        results.update({_short_url:job.get().url for _short_url, job in zip(_short_urls_chunked, jobs) if job.get() is not None and job.get().status_code==200})
    return results 

我已经尝试过grequests,但是它已经被放弃了,我经历了github pull请求,但是它们也都有问题.

I've tried grequests, but it's been abandoned, and I've gone through the github pull requests, but they all have issues too.

推荐答案

您正在观察的RAM使用情况主要来自存储100.000响应对象时堆积的所有数据以及所有底层开销.我已经转载了您的应用案例,并针对来自Alexa排名最高的15000个URL触发了HEAD请求.没关系

The RAM usage you are observing mainly stems from all the data that piles up while storing 100.000 response objects, and all the underlying overhead. I have reproduced your application case, and fired off HEAD requests against 15000 URLS from the top Alexa ranking. It did not really matter

  • 我是否使用了一个gevent池(即每个连接一个greenlet)或一组固定的greenlet,它们都请求多个URL
  • 我设置游泳池的大小

最后,RAM的使用随着时间的推移而增长,数量可观.但是,我注意到从requests更改为urllib2已经导致RAM使用减少了大约两倍.也就是说,我替换了

In the end, the RAM usage grew over time, to considerable amounts. However, I noticed that changing from requests to urllib2 already lead to a reduction in RAM usage, by about factor two. That is, I replaced

result = requests.head(url)

使用

request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
result = urllib2.urlopen(request)

其他一些建议:不要使用两种超时机制. Gevent的超时方法非常可靠,您可以像这样轻松地使用它:

Some other advice: do not use two timeout mechanisms. Gevent's timeout approach is very solid, and you can easily use it like this:

def gethead(url):
    result = None
    try:
        with Timeout(5, False):
            result = requests.head(url)
    except Exception as e:
        result = e
    return result

可能看起来很棘手,但要么返回None(恰好在5秒钟后,并指示超时),要么代表通信错误的任何异常对象,要么返回响应.很棒!

Might look tricky, but either returns None (after quite precisely 5 seconds, and indicates timeout), any exception object representing a communication error, or the response. Works great!

尽管这可能不是问题的一部分,但是在这种情况下,我建议让工人保持生存的状态,并让他们分别处理多个项目!的确,产生绿色小鸟的开销很小.不过,这将是一个非常简单的解决方案,其中包含一组长期存在的greenlet:

Although this likely is not part of the issue, in such cases I recommend to keep workers alive and let them work on multiple items each! The overhead of spawning greenlets is small, indeed. Still, this would be a very simple solution with a set of long-lived greenlets:

def qworker(qin, qout):
    while True:
        try:
            qout.put(gethead(qin.get(block=False)))
        except Empty:
            break

qin = Queue()
qout = Queue()

for url in urls:
    qin.put(url)

workers = [spawn(qworker, qin, qout) for i in xrange(POOLSIZE)]
joinall(workers)
returnvalues = [qout.get() for _ in xrange(len(urls))]

此外,您真的需要欣赏这是您要解决的一个大规模问题,从而产生非标准问题.当我以20 s的超时和100个worker以及15000个URL的请求重现您的方案时,我很容易获得大量的套接字:

Also, you really need to appreciate that this is a large-scale problem you are tackling there, yielding non-standard issues. When I reproduced your scenario with a timeout of 20 s and 100 workers and 15000 URLs to be requested, I easily got a large number of sockets:

# netstat -tpn | wc -l
10074

也就是说,操作系统要管理的插槽超过10000个,其中大多数处于TIME_WAIT状态.我还观察到打开文件太多"错误,并通过sysctl调整了限制.当您请求100.000个URL时,您也可能会遇到这样的限制,并且您需要采取一些措施来防止系统饿死.

That is, the OS had more than 10000 sockets to manage, most of them in TIME_WAIT state. I also observed "Too many open files" errors, and tuned the limits up, via sysctl. When you request 100.000 URLs you will probably hit such limits, too, and you need to come up with measures to prevent system starving.

还请注意您使用请求的方式,它会自动遵循从HTTP到HTTPS的重定向,并自动验证证书,所有这些肯定会花费RAM.

Also note the way you are using requests, it automatically follows redirects from HTTP to HTTPS, and automatically verifies the certificate, all of which surely costs RAM.

在我的测量中,当我将请求的URL数除以程序的运行时时,我几乎从未通过100次响应/秒,这是与世界各地的外部服务器之间的高延迟连接的结果.我想您也会受此限制的影响.将体系结构的其余部分调整到此限制,您可能将能够生成从Internet到磁盘(或数据库)的数据流,并且它们之间的RAM使用量不会太大.

In my measurements, when I divided the number of requested URLs by the runtime of the program, I almost never passed 100 responses/s, which is the result of the high-latency connections to foreign servers all over the world. I guess you also are affected by such a limit. Adjust the rest of the architecture to this limit, and you will probably be able to generate a data stream from the Internet to disk (or database) with not so large RAM usage inbetween.

我应该解决您的两个主要问题,

I should address your two main questions, specifically:

我认为gevent/使用方式不是问题所在.我认为您只是低估了任务的复杂性.它伴随着令人讨厌的问题,并将您的系统推向极限.

I think gevent/the way you are using it is not your problem. I think you are just underestimating the complexity of your task. It comes along with nasty problems, and drives your system to its limits.

  • 您的RAM使用问题:如果可以,请使用urllib2开始.然后,如果累积量仍然过高,则需要对累积量进行努力.尝试产生一个稳定状态:您可能想开始将数据写到磁盘上,并通常在可能会导致对象被垃圾回收的情况下工作.

  • your RAM usage issue: Start off by using urllib2, if you can. Then, if things accumulate still too high, you need to work against accumulation. Try to produce a steady state: you might want to start writing off data to disk and generally work towards the situation where objects can become garbage collected.

您的代码最终挂起":可能这与您的RAM问题有关.如果不是,则不要产生那么多的greenlet,而是按照指示重新使用.另外,进一步减少并发性,监视打开的套接字的数量,必要时增加系统限制,并尝试找出确切您的软件挂起的位置.

your code "eventually hangs": probably this is as of your RAM issue. If it is not, then do not spawn so many greenlets, but reuse them as indicated. Also, further reduce concurrency, monitor the number of open sockets, increase system limits if necessary, and try to find out exactly where your software hangs.

这篇关于gevent/请求挂起,同时发出大量请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆