Python urllib2.urlopen()很慢,需要更好的方法来读几个网址 [英] Python urllib2.urlopen() is slow, need a better way to read several urls
问题描述
正如标题所示,我正在一个网站编写的python,它做了几个调用urllib2模块来阅读网站。我然后解析他们与BeautifulSoup。
As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.
由于我必须读取5-10个网站,所以网页需要一段时间才能载入。
As I have to read 5-10 sites, the page takes a while to load.
我只是想知道是否有一种方法来立即读取网站?
I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?
添加:也可以使用任何一种方法来使它更快,比如我应该在每次阅读后关闭urllib2.urlopen,如果我只是切换到PHP,这将是更快的抓取和Parsi HTML和XML文件从其他网站?
Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes
推荐答案
我要重写Dumb Guy的代码下面使用线程
和队列
等现代Python模块。
I'm rewriting Dumb Guy's code below using modern Python modules like threading
and Queue
.
import threading, urllib2
import Queue
urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]
def read_url(url, queue):
data = urllib2.urlopen(url).read()
print('Fetched %s from %s' % (len(data), url))
queue.put(data)
def fetch_parallel():
result = Queue.Queue()
threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
for t in threads:
t.start()
for t in threads:
t.join()
return result
def fetch_sequencial():
result = Queue.Queue()
for url in urls_to_load:
read_url(url,result)
return result
find_sequencial ()
为2s。 fetch_parallel()
的最佳时间为0.9秒。
Best time for find_sequencial()
is 2s. Best time for fetch_parallel()
is 0.9s.
线程
在Python中是无用的,因为GIL。这是线程在Python中有用的情况之一,因为线程在I / O上被阻塞。正如你在我的结果中可以看到的,并行情况是快2倍。
Also it is incorrect to say thread
is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.
这篇关于Python urllib2.urlopen()很慢,需要更好的方法来读几个网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!