如何并行化文件下载？ [英] How to parallelized file downloads?

查看：141 发布时间：2017/7/13 9:51:40 python python-3.x download subprocess wget

本文介绍了如何并行化文件下载？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我可以一次下载文件：

import urllib.request

urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz']

for u in urls:
  urllib.request.urlretrieve(u)

c $ c>它这样：

I could try to subprocess it as such:

import subprocess
import os

def parallelized_commandline(command, files, max_processes=2):
    processes = set()
    for name in files:
        processes.add(subprocess.Popen([command, name]))
        if len(processes) >= max_processes:
            os.wait()
            processes.difference_update(
                [p for p in processes if p.poll() is not None])

    #Check if all the child processes were closed
    for p in processes:
        if p.poll() is None:
            p.wait()

urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz', 
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz']

parallelized_commandline('wget', urls)

有没有办法并行化 urlretrieve 不使用 os.system 或 subprocess 作弊？

Is there any way to parallelize urlretrieve without using os.system or subprocess to cheat?

subprocess.Popen 正确的方法来下载数据？

Given that I must resort to the "cheat" for now, is subprocess.Popen the right way to download the data?

当使用上面的 parallelized_commandline（）时，它使用多线程但不是多核的 wget ，是正常吗？有没有办法使其成为多核，而不是多线程？

When using the parallelized_commandline() above, it's using multi-thread but not multi-core for the wget, is that normal? Is there a way to make it multi-core instead of multi-thread?

推荐答案

您可以使用线程池下载文件并行：

You could use a thread pool to download files in parallel:

#!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve

urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

您还可以使用<$ c $在一个线程中一次下载多个文件c> asyncio ：

#!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp

@asyncio.coroutine
def download(url, session, semaphore, chunk_size=1<<15):
    with (yield from semaphore): # limit number of concurrent downloads
        filename = url2filename(url)
        logging.info('downloading %s', filename)
        response = yield from session.get(url)
        with closing(response), open(filename, 'wb') as file:
            while True: # save file
                chunk = yield from response.content.read(chunk_size)
                if not chunk:
                    break
                file.write(chunk)
        logging.info('done %s', filename)
    return filename, (response.status, tuple(response.headers.items()))

urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, \
     closing(aiohttp.ClientSession()) as session:
    semaphore = asyncio.Semaphore(4)
    download_tasks = (download(url, session, semaphore) for url in urls)
    result = loop.run_until_complete(asyncio.gather(*download_tasks))

其中 url2filename（）在此定义。

where url2filename() is defined here.

这篇关于如何并行化文件下载？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何并行化文件下载？ [英] How to parallelized file downloads?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何并行化文件下载？ [英] How to parallelized file downloads?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭