Python Urllib UrlOpen 读取 [英] Python Urllib UrlOpen Read

查看:39
本文介绍了Python Urllib UrlOpen 读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我正在使用 Python 中的 Urllib2 库从服务器检索 Urls 列表.我注意到获取一页大约需要 5 秒钟,完成我想要收集的所有页面需要很长时间.

Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect.

我在思考这 5 秒.大部分时间都在服务器端消耗,我想知道我是否可以开始使用线程库.在这种情况下说 5 个线程,那么平均时间可能会显着增加.每页可能 1 或 2 秒.(可能会使服务器有点忙).我怎样才能优化线程数,以便获得合法的速度而不是太用力地推动服务器.

I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make the server a bit busy). How could I optimize the number of threads so I could get a legit speed and not pushing the server too hard.

谢谢!

更新:我一一增加线程数,并监控抓取 100 个 URL 所花费的总时间(单位:分钟).事实证明,当您将线程数更改为 2 时,总时间显着减少,并随着线程数的增加而不断减少,但线程化带来的改进"变得越来越不明显.(当您构建太多线程时,总时间甚至会显示反弹)我知道这只是我收获的 Web 服务器的一个特定案例,但我决定分享只是为了展示线程的力量,希望有一天能对某人有所帮助.

Updated: I increased the number of threads one by one and monitored the total time (units: minutes) spent to scrape 100 URLs. and it turned out that the total time dramatically decreased when you change the number of threads to 2, and keep decreasing as you increase the number of threads, but the 'improvement' caused by threading become less and less obvious. (the total time even shows a bounce back when you build too many threads) I know this is only a specific case for the web server that I harvest but I decided to share just to show the power of threading and hope would be helpful for somebody one day.

推荐答案

您可以做一些事情.如果 URL 位于不同的域中,那么您可能只是将工作分散到线程中,每个线程从不同的域下载一个页面.

There are a few things you can do. If the URLs are on different domains, then you might just fan out the work to threads, each downloading a page from a different domain.

如果您的 URL 都指向同一个服务器并且您不想给服务器带来压力,那么您可以按顺序检索 URL.如果服务器对几个并行请求感到满意,您可以查看 工人池.您可以开始,假设有四个工作人员池,并将您的所有 URL 添加到队列中,工作人员将从该队列中提取新 URL.

If your URLs all point to the same server and you do not want stress the server, then you can just retrieve the URLs sequentially. If the server is happy with a couple of parallel requests, the you can look into pools of workers. You could start, say a pool of four workers and add all your URL to a queue, from which the workers will pull new URLs.

由于您也将问题标记为屏幕抓取",scrapy 是一个专用的抓取框架,它可以并行.

Since you tagged the question with "screen-scraping" as well, scrapy is a dedicated scraping framework, which can work in parallel.

Python 3 在 concurrent.futures.

Python 3 comes with a set of new builtin concurrency primitives under concurrent.futures.

这篇关于Python Urllib UrlOpen 读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆