龙卷风:从迭代器AsyncHttpClient.fetch? [英] tornado: AsyncHttpClient.fetch from an iterator?

查看:182
本文介绍了龙卷风:从迭代器AsyncHttpClient.fetch?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图写一个网络爬虫的事情,并希望使HTTP请求尽快。 龙卷风AsyncHttpClient 似乎是一个不错的选择,但所有的例如code我看到(如 http://stackoverflow.com/a/25549675/1650177 )基本上叫 AsyncHttpClient.fetch 巨大的网址,让龙卷风排队起来,并最终使请求的列表中。

I'm trying to write a web crawler thing and want to make HTTP requests as quickly as possible. tornado's AsyncHttpClient seems like a good choice, but all the example code I've seen (e.g. http://stackoverflow.com/a/25549675/1650177) basically call AsyncHttpClient.fetch on a huge list of URLs to let tornado queue them up and eventually make the requests.

但如果我要处理无限长(或只是一个真正的大),从文件或网络URL列表?我不希望加载所有URL到内存中。

But what if I want to process an indefinitely long (or just a really big) list of URLs from a file or the network? I don't want to load all the URLs into memory.

用Google搜索周围,但似乎无法找到一个迭代的方式来 AsyncHttpClient.fetch 。但是我没有找到一个方法来做到我想要使用的是什么GEVENT:的http:/ /gevent.org/gevent.threadpool.html#gevent.threadpool.ThreadPool.imap 。有没有办法做到在龙卷风类似的事情?

Googled around but can't seem to find a way to AsyncHttpClient.fetch from an iterator. I did however find a way to do what I want using gevent: http://gevent.org/gevent.threadpool.html#gevent.threadpool.ThreadPool.imap. Is there a way to do something similar in tornado?

一个解决方案,我想到的是,只排队那么多的网址,然后开始添加逻辑更多的排队时,操作完成,但我希望有一个更清洁的解决方案。

One solution I've thought of is to only queue up so many URLs initially then add logic to queue up more when a fetch operation completes but I'm hoping there's a cleaner solution.

任何帮助或建议将AP preciated!

Any help or recommendations would be appreciated!

推荐答案

我会用队列和多个工人为此,在上的https://github.com/tornadoweb/tornado/blob/master/demos/webspider/webspider.py

I would do this with a Queue and multiple workers, in a variation on https://github.com/tornadoweb/tornado/blob/master/demos/webspider/webspider.py

import tornado.queues
from tornado import gen
from tornado.httpclient import AsyncHTTPClient
from tornado.ioloop import IOLoop

NUM_WORKERS = 10
QUEUE_SIZE = 100
q = tornado.queues.Queue(QUEUE_SIZE)
AsyncHTTPClient.configure(None, max_clients=NUM_WORKERS)
http_client = AsyncHTTPClient()

@gen.coroutine
def worker():
    while True:
        url = yield q.get()
        try:
            response = yield http_client.fetch(url)
            print('got response from', url)
        except Exception:
            print('failed to fetch', url)
        finally:
            q.task_done()

@gen.coroutine
def main():
    for i in range(NUM_WORKERS):
        IOLoop.current().spawn_callback(worker)
    with open("urls.txt") as f:
        for line in f:
            url = line.strip()
            # When the queue fills up, stop here to wait instead
            # of reading more from the file.
            yield q.put(url)
    yield q.join()

if __name__ == '__main__':
    IOLoop.current().run_sync(main)

这篇关于龙卷风:从迭代器AsyncHttpClient.fetch?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆