加速 urlib.urlretrieve [英] speeding up urlib.urlretrieve
问题描述
我正在从 Internet 下载图片,结果证明我需要下载大量图片.我正在使用以下代码片段的一个版本(实际上是循环访问我打算下载和下载图片的链接:
I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
我每 15 分钟下载大约 1000 张图片,根据我需要下载的图片数量,这非常慢.
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
为了效率,我每 5 秒设置一个超时时间(仍然有很多下载持续时间更长):
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
除了在计算机集群上运行一个作业来并行下载,有没有办法让图片下载更快/更高效?
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?
推荐答案
我上面的代码非常幼稚,因为我没有利用多线程.url请求显然需要响应,但没有理由在代理服务器响应时计算机不能进一步请求.
my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
进行以下调整,您可以将效率提高 10 倍 - 并且还有进一步提高效率的方法,例如scrapy.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
要添加多线程,请使用 multiprocessing 包执行以下操作:
To add multi-threading, do something like the following, using the multiprocessing package:
1) 将获取的 url 封装在一个函数中:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) 然后为下载的图片创建一个包含所有 url 和名称的集合:
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3) 从 multiprocessing 包中导入 Pool 类并使用该类创建一个对象(显然,您将在实际程序的代码的第一行中包含所有导入):
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
然后使用 pool.starmap() 方法并传递函数和函数的参数.
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
注意:pool.starmap() 仅适用于 Python 3
note: pool.starmap() works only in Python 3
这篇关于加速 urlib.urlretrieve的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!