如何并行化文件下载? [英] How to parallelized file downloads?
问题描述
我可以一次下载文件:
import urllib.request
urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz']
for u in urls:
urllib.request.urlretrieve(u)
c $ c>它这样:
I could try to subprocess
it as such:
import subprocess
import os
def parallelized_commandline(command, files, max_processes=2):
processes = set()
for name in files:
processes.add(subprocess.Popen([command, name]))
if len(processes) >= max_processes:
os.wait()
processes.difference_update(
[p for p in processes if p.poll() is not None])
#Check if all the child processes were closed
for p in processes:
if p.poll() is None:
p.wait()
urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz']
parallelized_commandline('wget', urls)
有没有办法并行化 urlretrieve
不使用 os.system
或 subprocess
作弊?
Is there any way to parallelize urlretrieve
without using os.system
or subprocess
to cheat?
subprocess.Popen 正确的方法来下载数据?
Given that I must resort to the "cheat" for now, is subprocess.Popen
the right way to download the data?
当使用上面的 parallelized_commandline()
时,它使用多线程但不是多核的 wget
,是正常吗?有没有办法使其成为多核,而不是多线程?
When using the parallelized_commandline()
above, it's using multi-thread but not multi-core for the wget
, is that normal? Is there a way to make it multi-core instead of multi-thread?
推荐答案
您可以使用线程池下载文件并行:
You could use a thread pool to download files in parallel:
#!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve
urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time
您还可以使用<$ c $在一个线程中一次下载多个文件c> asyncio :
#!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp
@asyncio.coroutine
def download(url, session, semaphore, chunk_size=1<<15):
with (yield from semaphore): # limit number of concurrent downloads
filename = url2filename(url)
logging.info('downloading %s', filename)
response = yield from session.get(url)
with closing(response), open(filename, 'wb') as file:
while True: # save file
chunk = yield from response.content.read(chunk_size)
if not chunk:
break
file.write(chunk)
logging.info('done %s', filename)
return filename, (response.status, tuple(response.headers.items()))
urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, \
closing(aiohttp.ClientSession()) as session:
semaphore = asyncio.Semaphore(4)
download_tasks = (download(url, session, semaphore) for url in urls)
result = loop.run_until_complete(asyncio.gather(*download_tasks))
where url2filename()
is defined here.
这篇关于如何并行化文件下载?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!