加速 urlib.urlretrieve [英] speeding up urlib.urlretrieve

查看:26
本文介绍了加速 urlib.urlretrieve的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从 Internet 下载图片,结果证明我需要下载大量图片.我正在使用以下代码片段的一个版本(实际上是循环访问我打算下载和下载图片的链接:

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :

import urllib
urllib.urlretrieve(link, filename)

我每 15 分钟下载大约 1000 张图片,根据我需要下载的图片数量,这非常慢.

I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.

为了效率,我每 5 秒设置一个超时时间(仍然有很多下载持续时间更长):

For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):

import socket
socket.setdefaulttimeout(5)

除了在计算机集群上运行一个作业来并行下载,有没有办法让图片下载更快/更高效?

Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?

推荐答案

我上面的代码非常幼稚,因为我没有利用多线程.url请求显然需要响应,但没有理由在代理服务器响应时计算机不能进一步请求.

my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.

进行以下调整,您可以将效率提高 10 倍 - 并且还有进一步提高效率的方法,例如scrapy.

Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.

要添加多线程,请使用 multiprocessing 包执行以下操作:

To add multi-threading, do something like the following, using the multiprocessing package:

1) 将获取的 url 封装在一个函数中:

1) encapsulate the url retrieving in a function:

import import urllib.request

def geturl(link,i):
try:
    urllib.request.urlretrieve(link, str(i)+".jpg")
except:
    pass

2) 然后为下载的图片创建一个包含所有 url 和名称的集合:

2) then create a collection with all urls as well as names you want for the downloaded pictures:

urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]

3) 从 multiprocessing 包中导入 Pool 类并使用该类创建一个对象(显然,您将在实际程序的代码的第一行中包含所有导入):

3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)

然后使用 pool.starmap() 方法并传递函数和函数的参数.

then use the pool.starmap() method and pass the function and the arguments of the function.

results = pool.starmap(geturl, zip(links, d))

注意:pool.starmap() 仅适用于 Python 3

note: pool.starmap() works only in Python 3

这篇关于加速 urlib.urlretrieve的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆