Python 2.6:使用urllib2进行并行解析 [英] Python 2.6: parallel parsing with urllib2

查看:82
本文介绍了Python 2.6:使用urllib2进行并行解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用urllib2从网站检索和解析页面.但是,它们很多(超过1000个),并且顺序地处理它们很慢.

I'm currently retrieving and parsing pages from a website using urllib2. However, there are many of them (more than 1000), and processing them sequentially is painfully slow.

我希望有一种以并行方式检索和解析页面的方法.如果那是个好主意,那有可能吗,我该怎么做?

I was hoping there was a way to retrieve and parse pages in a parallel fashion. If that's a good idea, is it possible, and how do I do it?

此外,并行处理的页面数的合理"值是什么(我不想因为对服务器使用过多的连接而对服务器造成太大的压力或被禁止)?

Also, what are "reasonable" values for the number of pages to process in parallel (I wouldn't want to put too much strain on the server or get banned because I'm using too many connections)?

谢谢!

推荐答案

您始终可以使用线程(即,在单独的线程中运行每次下载).对于大量用户,这可能会占用太多资源,在这种情况下,我建议您查看 gevent ,尤其是此示例,这可能正是您所需要的

You can always use threads (i.e. run each download in a separate thread). For large numbers, this could be a little too resource hogging, in which case I recommend you take a look at gevent and specifically this example, which may be just what you need.

(来自gevent.org:gevent是基于协程的Python网络库,它使用greenlet在libevent事件循环之上提供高级同步API")

(from gevent.org: "gevent is a coroutine-based Python networking library that uses greenlet to provide a high-level synchronous API on top of libevent event loop")

这篇关于Python 2.6:使用urllib2进行并行解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆