Python urllib2.urlopen() 很慢,需要更好的方法来读取几个url [英] Python urllib2.urlopen() is slow, need a better way to read several urls

查看:38
本文介绍了Python urllib2.urlopen() 很慢,需要更好的方法来读取几个url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如标题所暗示的,我正在开发一个用 python 编写的网站,它多次调用 urllib2 模块来读取网站.然后我用 BeautifulSoup 解析它们.

As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

由于我必须阅读 5-10 个站点,因此页面需要一段时间才能加载.

As I have to read 5-10 sites, the page takes a while to load.

我只是想知道是否有办法一次阅读所有网站?或者任何让它更快的技巧,比如我应该在每次阅读后关闭 urllib2.urlopen 还是保持打开状态?

I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?

添加:另外,如果我只是切换到 php,那么从其他站点获取和解析 HTML 和 XML 文件会更快吗?我只是想让它加载得更快,而不是目前需要大约 20 秒

Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes

推荐答案

我正在使用 threadingQueue 等现代 Python 模块重写 Dumb Guy 的代码.

I'm rewriting Dumb Guy's code below using modern Python modules like threading and Queue.

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

find_sequencial() 的最佳时间是 2 秒.fetch_parallel() 的最佳时间是 0.9 秒.

Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

此外,由于 GIL,说 thread 在 Python 中没有用也是不正确的.这是线程在 Python 中很有用的情况之一,因为线程在 I/O 上被阻塞.正如您在我的结果中看到的,并行案例的速度提高了 2 倍.

Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

这篇关于Python urllib2.urlopen() 很慢,需要更好的方法来读取几个url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆