Bash中的wget平行 [英] Parallel wget in Bash

查看:218
本文介绍了Bash中的wget平行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到了一堆从一个网站比较小的网页,并想知道如果我能以某种方式在猛砸并行做到这一点。目前我的code看起来是这样,但它需要一段时间来执行(我认为正在放缓我失望是在连接的延迟)。

I am getting a bunch of relatively small pages from a website and was wondering if I could somehow do it in parallel in Bash. Currently my code looks like this, but it takes a while to execute (I think what is slowing me down is the latency in the connection).

for i in {1..42}
do
    wget "https://www.example.com/page$i.html"
done

我听说使用的xargs,但我不知道这件事,该名男子网页是非常混乱的事情。有任何想法吗?它甚至有可能做到这一点并行?有另一种方法,我可以去进攻呢?

I have heard of using xargs, but I don't know anything about that and the man page is very confusing. Any ideas? Is it even possible to do this in parallel? Is there another way I could go about attacking this?

推荐答案

大部分preferrable使用&功放推的wget 进入后台; -b ,您可以使用的xargs 同样的效果,以及更好的。

Much preferrable to pushing wget into the background using & or -b, you can use xargs to the same effect, and better.

的优点是,的xargs 同步正确无需额外的工作。这意味着你可以安全访问下载的文件(假设没有发生错误)。所有下载将完成(或失败),一旦的xargs 退出,并且您可以通过退出$ C $知道C是否一切顺利。这是很多preferrable忙于睡眠和测试完成手动等待。

The advantage is that xargs will synchronize properly with no extra work. Which means that you are safe to access the downloaded files (assuming no error occurs). All downloads will have completed (or failed) once xargs exits, and you know by the exit code whether all went well. This is much preferrable to busy waiting with sleep and testing for completion manually.

假设 URL_LIST 的(可在OP的例子循环构造,但也可以手动生成的列表)包含所有URL的变量,运行此

Assuming that URL_LIST is a variable containing all the URLs (can be constructed with a loop in the OP's example, but could also be a manually generated list), running this:

echo $URL_LIST | xargs -n 1 -P 8 wget -q

将传递一个参数在同一时间( -n 1 )以的wget ,并执行最多8个并行在时间的wget 进程( P 8 )。 xarg 返回后,最后生成的进程已经完成,这正是我们想要知道的。无需额外的挂羊头卖狗肉必要的。

will pass one argument at a time (-n 1) to wget, and execute at most 8 parallel wget processes at a time (-P 8). xarg returns after the last spawned process has finished, which is just what we wanted to know. No extra trickery needed.

神奇数字那我选择的不是一成不变的8位并行下载,但它可能是一个很好的妥协。有两个因素最大化等一系列下载的:

The "magic number" of 8 parallel downloads that I've chosen is not set in stone, but it is probably a good compromise. There are two factors in "maximising" a series of downloads:

一种是填充电缆,即,利用该可用带宽。假设正常的条件(服务器比客户更多的带宽),这已经是一个或最多两个下载的情况。这个问题投入更多的连接,只会导致数据包丢失和TCP拥塞控制踢和的 N 的下载与渐近的 1 / N 的带宽,每一个,相同的净效应(减去丢弃的数据包,减去窗口大小恢复)。数据包丢失是在IP网络发生很正常的事情,这是拥塞控制是如何工作的(即使是一个单一的连接),通常的影响几乎为零。然而,具有一个不合理的大量连接放大该效果,因此,它可以被来明显。在任何情况下,它不会做任何事情更快。

One is filling "the cable", i.e. utilizing the available bandwidth. Assuming "normal" conditions (server has more bandwidth than client), this is already the case with one or at most two downloads. Throwing more connections at the problem will only result in packets being dropped and TCP congestion control kicking in, and N downloads with asymptotically 1/N bandwidth each, to the same net effect (minus the dropped packets, minus window size recovery). Packets being dropped is a normal thing to happen in an IP network, this is how congestion control is supposed to work (even with a single connection), and normally the impact is practically zero. However, having an unreasonably large number of connections amplifies this effect, so it can be come noticeable. In any case, it doesn't make anything faster.

的第二个因素是连接建立和请求处理。在这里,在飞行中有一些额外的连接的确实有助于的。上面临着的问题是两回合人次(通常在同一地理区域内20-40ms,200-300ms洲际),再加上奇1-2毫秒的延迟,服务器实际需要来处理请求和推回复到插座。这不是一个很多时间的本身的,但如果乘以几百/千请求时,它迅速增加。结果
具有半打在飞行了十几个请求任何隐藏大多数或所有这些延迟的(它仍然存在,但由于它重叠,不总结一下!)。在同一时间,仅具有几个并行连接不具有不利的影响,例如引起过度的拥塞,或迫使服务器到分叉新工艺

The second factor is connection establishment and request processing. Here, having a few extra connections in flight really helps. The problem one faces is the latency of two round-trips (typically 20-40ms within the same geographic area, 200-300ms inter-continental) plus the odd 1-2 milliseconds that the server actually needs to process the request and push a reply to the socket. This is not a lot of time per se, but multiplied by a few hundred/thousand requests, it quickly adds up.
Having anything from half a dozen to a dozen requests in-flight hides most or all of this latency (it is still there, but since it overlaps, it does not sum up!). At the same time, having only a few concurrent connections does not have adverse effects, such as causing excessive congestion, or forcing a server into forking new processes.

这篇关于Bash中的wget平行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆