Python urllib2.urlopen()很慢,需要更好的方法来读几个网址 [英] Python urllib2.urlopen() is slow, need a better way to read several urls

查看:2355
本文介绍了Python urllib2.urlopen()很慢,需要更好的方法来读几个网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如标题所示,我正在一个网站编写的python,它做了几个调用urllib2模块来阅读网站。我然后解析他们与BeautifulSoup。

As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

由于我必须读取5-10个网站,所以网页需要一段时间才能载入。

As I have to read 5-10 sites, the page takes a while to load.

我只是想知道是否有一种方法来立即读取网站?

I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?

添加:也可以使用任何一种方法来使它更快,比如我应该在每次阅读后关闭urllib2.urlopen,如果我只是切换到PHP,这将是更快的抓取和Parsi HTML和XML文件从其他网站?

Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes

推荐答案

我要重写Dumb Guy的代码下面使用线程队列等现代Python模块。

I'm rewriting Dumb Guy's code below using modern Python modules like threading and Queue.

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

find_sequencial ()为2s。 fetch_parallel()的最佳时间为0.9秒。

Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

线程在Python中是无用的,因为GIL。这是线程在Python中有用的情况之一,因为线程在I / O上被阻塞。正如你在我的结果中可以看到的,并行情况是快2倍。

Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

这篇关于Python urllib2.urlopen()很慢,需要更好的方法来读几个网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆