WGET-同时连接速度很慢 [英] WGET - Simultaneous connections are SLOW

查看:682
本文介绍了WGET-同时连接速度很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下命令将浏览器的响应从URL列表附加到相应的输出文件中:

I use the following command to append the browser's response from list of URLs into an according output file:

wget -i /Applications/MAMP/htdocs/data/urls.txt -O - \
     >> /Applications/MAMP/htdocs/data/export.txt

这很好,完成后会显示:

This works fine and when finished it says:

Total wall clock time: 1h 49m 32s
Downloaded: 9999 files, 3.5M in 0.3s (28.5 MB/s)

为了加快速度,我使用了:

In order to speed this up I used:

cat /Applications/MAMP/htdocs/data/urls.txt | \
   tr -d '\r' | \
   xargs -P 10 $(which wget) -i - -O - \
   >> /Applications/MAMP/htdocs/data/export.txt

哪个同时打开连接会使其更快一点:

Which opens simultaneous connections making it a little faster:

Total wall clock time: 1h 40m 10s
Downloaded: 3943 files, 8.5M in 0.3s (28.5 MB/s)

如您所见,它以某种方式忽略了一半以上的文件,大约需要花费大约20%的时间.同时完成.我不知道为什么.我在这里要做的是使用xargs一次下载10个文件(并行处理),并在‘STDOUT’完成后跳转到下一个URL.我是否缺少某些东西,或者可以通过其他方式完成?

As you can see, it somehow omits more than half of the files and takes approx. the same time to finish. I cannot guess why. What I want to do here is download 10 files at once (parallel processing) using xargs and jump to the next URL when the ‘STDOUT’ is finished. Am I missing something or can this be done elsewise?

另一方面,有人可以告诉我可以为连接设置的限制吗?知道处理器可以处理多少个连接而不会减慢系统速度,甚至避免出现某种 SYSTEM FAILURE .

On the other hand, can someone tell me what the limit that can be set is regarding the connections? It would really help to know how many connections my processor can handle without slowing down my system too much and even avoid some type of SYSTEM FAILURE.

我的 API费率限制如下:

每分钟的请求数 100

单个请求中的映射作业数 100

Number of mapping jobs in a single request 100

每分钟的映射作业总数 10,000

Total number of mapping jobs per minute 10,000

推荐答案

几件事:

  • 除非您的输入文件有些奇怪,否则我认为您不需要tr. xargs希望每行一项.
  • man xargs建议您将-n选项与-P一起使用;否则 很有可能只有一名高管会被执行."
  • 您正在使用wget -i -告诉wget从stdin读取URL.但是xargs将提供URL作为wget的参数.
  • 要调试,请用echo代替wget并检查其如何批处理参数
  • I don't think you need the tr, unless there's something weird about your input file. xargs expects one item per line.
  • man xargs advises you to "Use the -n option with -P; otherwise chances are that only one exec will be done."
  • You are using wget -i - telling wget to read URLs from stdin. But xargs will be supplying the URLs as parameters to wget.
  • To debug, substitute echo for wget and check how it's batching the parameters

所以这应该起作用:

 cat urls.txt | \
 xargs --max-procs=10 --max-args=100 wget --output-document=- 

(我更喜欢长参数---max-procs-P.--max-args-n)

(I've preferred long params - --max-procs is -P. --max-args is -n)

请参见具有多个同时连接的wget下载,以了解其他实现方式同样的东西,包括GNU parallel和一些专用的多线程HTTP客户端.

See wget download with multiple simultaneous connections for alternative ways of doing the same thing, including GNU parallel and some dedicated multi-threading HTTP clients.

但是,在大多数情况下,我预计并行化不会显着提高您的下载速度.

However, in most circumstances I would not expect parallelising to significantly increase your download rate.

在典型的用例中,瓶颈可能是您到服务器的网络链接.在单线程下载期间,您可能希望饱和该路由中最慢的链接.使用两个线程可能会获得很小的收益,因为一个线程可以下载而另一个线程正在发送请求.但这将是微不足道的收益.

In a typical use case, the bottleneck is likely to be your network link to the server. During a single-threaded download, you would expect to saturate the slowest link in that route. You may get very slight gains with two threads, because one thread can be downloading while the other is sending requests. But this will be a marginal gain.

因此,仅当您从多台服务器中获取数据时,这种方法才有价值,并且通往某些服务器的路由中最慢的链接不在客户端.

So this approach is only likely to be worthwhile if you're fetching from multiple servers, and the slowest link in the route to some servers is not at the client end.

这篇关于WGET-同时连接速度很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆