如何在 python 中使用 urllib2 加快获取页面的速度? [英] How can I speed up fetching pages with urllib2 in python?

查看:68
本文介绍了如何在 python 中使用 urllib2 加快获取页面的速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个脚本可以获取多个网页并解析信息.

(一个例子可以在 http://bluedevilbooks 上看到.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

我在它上面运行了 cProfile,正如我所假设的,urlopen 占用了很多时间.有没有办法更快地获取页面?或者一次获取多个页面的方法?我会做最简单的事情,因为我是 Python 和 Web 开发的新手.

提前致谢!:)

更新:我有一个名为 fetchURLs() 的函数,我用它来制作我需要的 URL 数组所以类似于 urls = fetchURLS().URL 都是来自 Amazon 和 eBay API 的 XML 文件(这让我很困惑,为什么加载需要这么长时间,也许我的虚拟主机很慢?)>

我需要做的是加载每个 URL,读取每个页面,并将该数据发送到脚本的另一部分,该部分将解析和显示数据.

请注意,在获取所有页面之前,我无法执行后一部分,这就是我的问题.

此外,我相信我的主机将我一次限制为 25 个进程,因此服务器上最简单的方法都很好:)

<小时>

时间到了:

Sun Aug 15 20:51:22 2010 教授22.254 CPU 秒内 211352 次函数调用(209292 次原始调用)订购者:内部时间由于限制 <10> 列表从 404 减少到 10ncalls tottime percall cumtime percall filename:lineno(function)10 18.056 1.806 18.056 1.806 {_socket.getaddrinfo}4991 2.730 0.001 2.730 0.001 {'_socket.socket'对象的'recv'方法}10 0.490 0.049 0.490 0.049 {'_socket.socket'对象的'connect'方法}2415 0.079 0.000 0.079 0.000 {'unicode'对象的'translate'方法}12 0.061 0.005 0.745 0.062/usr/local/lib/python2.6/HTMLParser.py:132(前进)3428 0.060 0.000 0.202 0.000/usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)1698 0.055 0.000 0.068 0.000/usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)4125 0.053 0.000 0.056 0.000/usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(设置)1698 0.042 0.000 0.358 0.000/usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)1698 0.042 0.000 0.275 0.000/usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)

解决方案

EDIT:我正在扩展答案以包含一个更完善的示例.我在这篇文章中发现了很多关于线程 vs. 的敌意和错误信息.异步 I/O.因此,我还添加了更多的论据来驳斥某些无效的主张.我希望这能帮助人们为合适的工作选择合适的工具.

这是对 3 天前问题的重复.

Python urllib2.open 很慢,需要更好的方法来读取多个 url_urllib2.open_酷徒编程知识库Python urllib2.urlopen() 很慢,需要更好的方法来读取多个 url

我正在完善代码以展示如何使用线程并行获取多个网页.

导入时间进口螺纹导入队列# 实用程序 - 为每个参数生成一个线程来执行目标def run_parallel_in_threads(target, args_list):结果 = Queue.Queue()# 在队列中收集返回值的包装器def task_wrapper(*args):结果.put(目标(*args))线程 = [threading.Thread(target=task_wrapper, args=args) 用于 args_list 中的 args]对于线程中的 t:t.start()对于线程中的 t:t.join()返回结果def dummy_task(n):对于 xrange(n) 中的 i:时间.睡眠(0.1)返回 n# 下面是应用代码网址 = [('http://www.google.com/',),('http://www.lycos.com/',),('http://www.bing.com/',),('http://www.altavista.com/',),('http://achewood.com/',),]定义获取(网址):返回 urllib2.urlopen(url).read()run_parallel_in_threads(获取,网址)

如您所见,应用程序特定代码只有 3 行,如果您有攻击性,可以将其折叠为 1 行.我认为没有人可以证明他们声称这是复杂且不可维护的.

不幸的是,这里发布的大多数其他线程代码都有一些缺陷.他们中的许多人进行主动轮询以等待代码完成.join() 是一种更好的同步代码的方式.我认为这段代码已经改进了迄今为止的所有线程示例.

保持活动连接

如果您的所有 URL 都指向同一服务器,WoLpH 关于使用保持活动连接的建议可能会非常有用.

扭曲

Aaron Gallagher 是 twisted 框架的粉丝,他对任何提出线程建议的人都怀有敌意.不幸的是,他的许多说法都是错误信息.例如,他说-1 用于建议线程.这是 IO 绑定的;线程在这里没用."这与证据相反,因为 Nick T 和我都证明了使用线程的速度增益.事实上,I/O 密集型应用程序从使用 Python 的线程中获益最多(对比 CPU 密集型应用程序没有任何好处).Aaron 对线程的误导性批评表明,他对并行编程总体上相当困惑.

适合工作的合适工具

我很清楚与使用线程、python、异步 I/O 等进行并行编程有关的问题.每种工具都有其优点和缺点.对于每种情况,都有一个合适的工具.我不反对扭曲(虽然我自己没有部署一个).但我不相信我们可以直截了当地说线程不好,扭曲在所有情况下都是好的.

例如,如果 OP 的要求是并行获取 10,000 个网站,则首选异步 I/O.线程不会被占用(除非可能使用无堆栈 Python).

Aaron 对线程的反对主要是概括.他没有意识到这是一项微不足道的并行化任务.每个任务都是独立的,不共享资源.所以他的大部分攻击都不适用.

鉴于我的代码没有外部依赖,我将其称为适合工作的正确工具.

性能

我想大多数人都会同意这项任务的性能在很大程度上取决于网络代码和外部服务器,平台代码的性能对这里的影响可以忽略不计.然而,Aaron 的基准测试显示,线程代码的速度提高了 50%.我认为有必要对这种明显的速度增益做出反应.

在 Nick 的代码中,有一个明显的缺陷导致了效率低下.但是你如何解释我的代码 233ms 的速度增益?我认为即使是twisted 的粉丝也不会下结论将这归因于twisted 的效率.毕竟系统代码之外还有大量的变量,比如远程服务器的性能、网络、缓存、urllib2和twisted web客户端的不同实现等等.

为了确保 Python 的线程不会导致大量的低效率,我做了一个快速基准测试以生成 5 个线程,然后生成 500 个线程.我很自在地说产生 5 个线程的开销可以忽略不计,无法解释 233 毫秒的速度差异.

在 [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)CPU 时间:用户 0.00 秒,系统:0.00 秒,总计:0.00 秒挂壁时间:0.00 秒Out[275]:在 [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)CPU 时间:用户 0.16 秒,系统:0.00 秒,总计:0.16 秒挂壁时间:0.16 秒在 [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)CPU 时间:用户 1.13 秒,系统:0.00 秒,总计:1.13 秒挂壁时间:1.13 秒<<<<<<<<<<这意味着 0.13 秒的开销

对我的并行提取的进一步测试表明,在 17 次运行中,响应时间存在巨大差异.(不幸的是,我没有去验证 Aaron 的代码).

0.75 秒0.38 秒0.59 秒0.38 秒0.62 秒1.50 秒0.49 秒0.36 秒0.95 秒0.43 秒0.61 秒0.81 秒0.46 秒1.21 秒2.87 秒1.04 秒1.72 秒

我的测试不支持 Aaron 的结论,即线程在可测量的范围内始终比异步 I/O 慢.鉴于涉及的变量数量,我不得不说这不是衡量异步 I/O 和线程之间系统性能差异的有效测试.

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

I ran cProfile on it, and as I assumed, urlopen takes up a lot of time. Is there a way to fetch the pages faster? Or a way to fetch several pages at once? I'll do whatever is simplest, as I'm new to python and web developing.

Thanks in advance! :)

UPDATE: I have a function called fetchURLs(), which I use to make an array of the URLs I need so something like urls = fetchURLS().The URLS are all XML files from Amazon and eBay APIs (which confuses me as to why it takes so long to load, maybe my webhost is slow?)

What I need to do is load each URL, read each page, and send that data to another part of the script which will parse and display the data.

Note that I can't do the latter part until ALL of the pages have been fetched, that's what my issue is.

Also, my host limits me to 25 processes at a time, I believe, so whatever is easiest on the server would be nice :)


Here it is for time:

Sun Aug 15 20:51:22 2010    prof

         211352 function calls (209292 primitive calls) in 22.254 CPU seconds

   Ordered by: internal time
   List reduced from 404 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10   18.056    1.806   18.056    1.806 {_socket.getaddrinfo}
     4991    2.730    0.001    2.730    0.001 {method 'recv' of '_socket.socket' objects}
       10    0.490    0.049    0.490    0.049 {method 'connect' of '_socket.socket' objects}
     2415    0.079    0.000    0.079    0.000 {method 'translate' of 'unicode' objects}
       12    0.061    0.005    0.745    0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)
     3428    0.060    0.000    0.202    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)
     1698    0.055    0.000    0.068    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)
     4125    0.053    0.000    0.056    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)
     1698    0.042    0.000    0.358    0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)
     1698    0.042    0.000    0.275    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)

解决方案

EDIT: I'm expanding the answer to include a more polished example. I have found a lot hostility and misinformation in this post regarding threading v.s. async I/O. Therefore I also adding more argument to refute certain invalid claim. I hope this will help people to choose the right tool for the right job.

This is a dup to a question 3 days ago.

Python urllib2.open is slow, need a better way to read several urls - Stack Overflow Python urllib2.urlopen() is slow, need a better way to read several urls

I'm polishing the code to show how to fetch multiple webpage in parallel using threads.

import time
import threading
import Queue

# utility - spawn a thread to execute target for each args
def run_parallel_in_threads(target, args_list):
    result = Queue.Queue()
    # wrapper to collect return value in a Queue
    def task_wrapper(*args):
        result.put(target(*args))
    threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def dummy_task(n):
    for i in xrange(n):
        time.sleep(0.1)
    return n

# below is the application code
urls = [
    ('http://www.google.com/',),
    ('http://www.lycos.com/',),
    ('http://www.bing.com/',),
    ('http://www.altavista.com/',),
    ('http://achewood.com/',),
]

def fetch(url):
    return urllib2.urlopen(url).read()

run_parallel_in_threads(fetch, urls)

As you can see, the application specific code has only 3 lines, which can be collapsed into 1 line if you are aggressive. I don't think anyone can justify their claim that this is complex and unmaintainable.

Unfortunately most other threading code posted here has some flaws. Many of them do active polling to wait for the code to finish. join() is a better way to synchronize the code. I think this code has improved upon all the threading examples so far.

keep-alive connection

WoLpH's suggestion about using keep-alive connection could be very useful if all you URLs are pointing to the same server.

twisted

Aaron Gallagher is a fans of twisted framework and he is hostile any people who suggest thread. Unfortunately a lot of his claims are misinformation. For example he said "-1 for suggesting threads. This is IO-bound; threads are useless here." This contrary to evidence as both Nick T and I have demonstrated speed gain from the using thread. In fact I/O bound application has the most to gain from using Python's thread (v.s. no gain in CPU bound application). Aaron's misguided criticism on thread shows he is rather confused about parallel programming in general.

Right tool for the right job

I'm well aware of the issues pertain to parallel programming using threads, python, async I/O and so on. Each tool has their pros and cons. For each situation there is an appropriate tool. I'm not against twisted (though I have not deployed one myself). But I don't believe we can flat out say that thread is BAD and twisted is GOOD in all situations.

For example, if the OP's requirement is to fetch 10,000 website in parallel, async I/O will be prefereable. Threading won't be appropriable (unless maybe with stackless Python).

Aaron's opposition to threads are mostly generalizations. He fail to recognize that this is a trivial parallelization task. Each task is independent and do not share resources. So most of his attack do not apply.

Given my code has no external dependency, I'll call it right tool for the right job.

Performance

I think most people would agree that performance of this task is largely depend on the networking code and the external server, where the performance of platform code should have negligible effect. However Aaron's benchmark show an 50% speed gain over the threaded code. I think it is necessary to response to this apparent speed gain.

In Nick's code, there is an obvious flaw that caused the inefficiency. But how do you explain the 233ms speed gain over my code? I think even twisted fans will refrain from jumping into conclusion to attribute this to the efficiency of twisted. There are, after all, a huge amount of variable outside of the system code, like the remote server's performance, network, caching, and difference implementation between urllib2 and twisted web client and so on.

Just to make sure Python's threading will not incur a huge amount of inefficiency, I do a quick benchmark to spawn 5 threads and then 500 threads. I am quite comfortable to say the overhead of spawning 5 thread is negligible and cannot explain the 233ms speed difference.

In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[275]: <Queue.Queue instance at 0x038B2878>

In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)
CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
Wall time: 0.16 s

In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)
CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s
Wall time: 1.13 s       <<<<<<<< This means 0.13s of overhead

Further testing on my parallel fetching shows a huge variability in the response time in 17 runs. (Unfortunately I don't have twisted to verify Aaron's code).

0.75 s
0.38 s
0.59 s
0.38 s
0.62 s
1.50 s
0.49 s
0.36 s
0.95 s
0.43 s
0.61 s
0.81 s
0.46 s
1.21 s
2.87 s
1.04 s
1.72 s

My testing does not support Aaron's conclusion that threading is consistently slower than async I/O by a measurable margin. Given the number of variables involved, I have to say this is not a valid test to measure the systematic performance difference between async I/O and threading.

这篇关于如何在 python 中使用 urllib2 加快获取页面的速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆