被 urllib2 阻止的 Python 进程 [英] Python Process blocked by urllib2

查看:47
本文介绍了被 urllib2 阻止的 Python 进程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我设置了一个进程来读取传入 url 的队列以供下载,但是当 urllib2 打开连接时系统挂起.

I set up a process that read a queue for incoming urls to download but when urllib2 open a connection the system hangs.

import urllib2, multiprocessing
from threading import Thread
from Queue import Queue
from multiprocessing import Queue as ProcessQueue, Process

def download(url):
    """Download a page from an url.
    url [str]: url to get.
    return [unicode]: page downloaded.
    """
    if settings.DEBUG:
        print u'Downloading %s' % url
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    encoding = response.headers['content-type'].split('charset=')[-1]
    content = unicode(response.read(), encoding)
    return content

def downloader(url_queue, page_queue):
    def _downloader(url_queue, page_queue):
        while True:
            try:
                url = url_queue.get()
                page_queue.put_nowait({'url': url, 'page': download(url)})
            except Exception, err:
                print u'Error downloading %s' % url
                raise err
            finally:
                url_queue.task_done()

    ## Init internal workers
    internal_url_queue = Queue()
    internal_page_queue = Queue()
    for num in range(multiprocessing.cpu_count()):
        worker = Thread(target=_downloader, args=(internal_url_queue, internal_page_queue))
        worker.setDaemon(True)
        worker.start()

    # Loop waiting closing
    for url in iter(url_queue.get, 'STOP'):
        internal_url_queue.put(url)

    # Wait for closing
    internal_url_queue.join()

# Init the queues
url_queue = ProcessQueue()
page_queue = ProcessQueue()

# Init the process
download_worker = Process(target=downloader, args=(url_queue, page_queue))
download_worker.start()

我可以从另一个模块添加网址,当我需要时,我可以停止进程并等待进程关闭.

From another module I can add urls and when I want I can stop the process and wait the process closing.

import module

module.url_queue.put('http://foobar1')
module.url_queue.put('http://foobar2')
module.url_queue.put('http://foobar3')
module.url_queue.put('STOP')
downloader.download_worker.join()

问题是,当我使用 urlopen ("response = urllib2.urlopen(request)") 时,它仍然全部被阻止.

The problem is that when I use urlopen ("response = urllib2.urlopen(request)") it remain all blocked.

如果我调用 download() 函数或者当我只使用没有 Process 的线程时都没有问题.

There are no problem if I call the download() function or when I use only threads without Process.

推荐答案

这里的问题不是 urllib2,而是 multiprocessing 模块的使用.在 Windows 下使用多处理模块时,不得使用在导入模块时立即运行的代码 - 而是将内容放在主模块中的 if __name__=='__main__' 块中.请参阅安全导入主模块"部分此处.

The issue here is not urllib2, but the use of the multiprocessing module. When using the multiprocessing module under Windows, you must not use code that runs immediately when importing your module - instead, put things in the main module inside a if __name__=='__main__' block. See section "Safe importing of main module" here.

对于您的代码,请在下载器模块中进行以下更改:

For your code, make this change following in the downloader module:

#....
def start():
    global download_worker
    download_worker = Process(target=downloader, args=(url_queue, page_queue))
    download_worker.start()

在主模块中:

import module
if __name__=='__main__':
    module.start()
    module.url_queue.put('http://foobar1')
    #....

因为你没有这样做,每次启动子进程都会再次运行主代码并启动另一个进程,导致挂起.

Because you didn't do this, each time the subprocess was started it would run the main code again and start another process, causing the hang.

这篇关于被 urllib2 阻止的 Python 进程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆