bash脚本中的并行WGET下载 [英] Parallel WGET download in bash script

查看:56
本文介绍了bash脚本中的并行WGET下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个小脚本,可以从文件中的给定列表下载图像.

I have this small script to download images from a given list in a file.

FILE=./img-url.txt
while read line; do
url=$line
wget -N -P /images/ $url
wget -N  -P /images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

问题是,它运行时间过长(文件中> 5000行).有什么办法可以加快速度吗?就像将源txt拆分为单独的文件并同时运行多个wget实例一样.

The problem is, that It runs too long (>5000 lines in the file). Is there any way to speed up things? Like split source txt into separate files and run multiple wget instances at the same time.

推荐答案

有很多解决方法.GNU Parallel将是最通用的解决方案,但是鉴于您提出问题的方式,是的,将文件拆分为多个部分,并同时在每个部分上运行脚本.将文件分割成几段是一个有趣的问题.100件意味着同时生成100个wget进程.几乎所有这些设备都将处于闲置状态,而很少的设备会利用所有网络带宽.据我所知,一个进程可能会占用一个小时的所有带宽,但是我猜测一个不错的折衷方案是将文件拆分为四个文件,因此4个wget进程同时运行.我将调用您的脚本geturls.sh.在命令行中输入.

There are a number of ways to go about this. GNU Parallel would be the most general solution, but given how you posed your question, yes, split the file into parts and run the script on each part simultaneously. How many pieces to split the file into is an interesting question. 100 pieces would mean spawning 100 wget processes simultaneously. Almost all of those will sit idle while a very few utilize all the network bandwidth. One process might utilize all the bandwidth for an hour for all I know, but I'm going to guess a good compromise is to split the file into four files, so 4 wget processes run simultaneously. I'm going to call your script geturls.sh. Type this at the command line.

split -l 4 img-url.txt
for f in xaa xab xac xad; do
    ./geturls.sh $f &
done

这会将您的文件分成四个甚至偶数个部分.默认情况下,为split命令输出文件指定了一些乏味的文件名,在本例中为xaa,xab等.for循环采用这些文件名,并将它们作为命令行参数提供给geturl.sh,这是命令行的第一件事.程序名称后的命令行.geturls.sh被放入后台(&),因此循环的下一次迭代可以立即发生.这样,geturls.sh实际上同时在文件的所有四个部分上运行,因此您同时有4个wget进程在运行.

This splits your file into four ~even pieces. The split command output files are by default given some bland file names, in this case xaa, xab, etc. The for loop takes the names of those pieces and gives them to geturl.sh as a command line argument, the first thing on the command line after the program name. The geturls.sh is put into the background (&) so the next iteration of the loop can happen immediately. In this way geturls.sh is run on all four pieces of the file virtually simultaneously, so you've got 4 wget processes going at the same time.

geturls.sh的内容是

The contents of geturls.sh is

#!/bin/bash
FILE=$1
while read line; do
url=$line
wget -N -P /images/ $url
wget -N  -P /images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

我对您的代码所做的唯一更改是外壳程序的显式声明(大多数情况是出于习惯),而且FILE现在已在$ 1变量中分配了值.回想一下$ 1是(第一个)命令行参数,这里是img-url.txt文件之一的名称.

The only change I made to your code was the explicit declaration of the shell (out of habit mostly) and also that FILE is now assigned the value in the $1 variable. Recall that $1 is the (first) command line argument, which is here the name of one of the pieces of your img-url.txt file.

这篇关于bash脚本中的并行WGET下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆