Perl太慢并发下载与HTTP :: Async&网::异步HTTP :: [英] Perl too slow concurrent download with both HTTP::Async & Net::Async::HTTP
问题描述
我正在尝试与脚本并行获取大约7几个URL:第一个是HTTP :: Async,第二个是 on pastebin ,使用Net :: Async :: HTTP。
问题是我得到的时间结果非常相似 - 所有网址列表大约8..14秒。与从shell开始的curl + xargs相比,这是不可接受的,它使用10-20线程在不到3秒的时间内完成。
例如,第一个脚本中的Devel :: Timer显示最大队列长度甚至小于6( $ queue-> in_progress_count
< = 5, code> $ queue-> to_send_count = 0 allways)。所以,看起来foreach与$ queue-> add执行得太慢,我不知道为什么。
与Net :: Async :: HTTP(pastebin中的第二个脚本)相同的情况,这比第一个更慢。
所以,请有谁知道,我做错了什么?至少与从shell开始的curl + xargs相比,如何获得并发下载速度?
#!/ usr / bin / perl -w
使用utf8;
使用strict;
使用POSIX qw(ceil);
使用XML :: Simple;
使用Data :: Dumper;
使用HTTP :: Request;
使用HTTP :: Async;
使用Time :: HiRes qw(usleep time);
使用Devel :: Timer;
#settings
使用常数passwd => ultramegahypapassword;
使用常量代理=> 'supa agent dev.alpha';
使用常量timeout => 10;
使用常量slot => 10;
使用常量debug => 1;
我的@qids;
我的@xmlz;
my $ queue = HTTP :: Async-> new(slots => slots,max_request_time => 10,timeout => timeout,poll_interval => 0.0001)
我的%回复;
我的@urlz =(
'http://testpodarki.afghanet/api/products/4577',
'http://testpodarki.afghanet/api/products/4653',
'http://testpodarki.afghanet/api/products/4652',
'http://testpodarki.afghanet/api/products/4571',
'http:// testpodarki。 afghanet / api / products / 4572',
'http://testpodarki.afghanet/api/products/4666',
'http://testpodarki.afghanet/api/products/4576',
'http://testpodarki.afghanet/api/products/4574',
'http://testpodarki.afghanet/api/products/4651',
'http:// testpodarki。 afghanet / api / stock_availables /?display = full& filter [id_product] = [3294]',
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [3294 ]',
'http://testpodarki.afghanet/api/combinations/?display = full& filter [id_product] = [4577]',
'http://testpodarki.afghanet/api/ stock_availables /?display = full& filter [id_product] = [4577]',
'http://testpodarki.afghanet/api/speci fic_prices /?display = full& filter [id_product] = [4577]',
'http://testpodarki.afghanet/api/product_option_values/188',
'http://testpodarki.afghanet/ api / product_option_values / 191',
'http://testpodarki.afghanet/api/product_option_values/187',
'http://testpodarki.afghanet/api/product_option_values/190',
'http://testpodarki.afghanet/api/product_option_values/189',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4653]',
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4653]',
'http://testpodarki.afghanet/api/images/products/ 4577/12176',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4652]',
'http://testpodarki.afghanet/ api / specific_prices /?display = full& filter [id_product] = [4652]',
'http://testpodarki.afghanet/api/images/products/4653/12390',
'http: //testpodarki.afgh anet / api / combine /?display = full& filter [id_product] = [4571]',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4571 ]',
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4571]',
'http://testpodarki.afghanet/api/图像/产品/ 4652/12388',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/180',
'http:// testpodarki。 afghanet / api / product_option_values / 181',
'http://testpodarki.afghanet/api/images/products/3294/8965',
'http://testpodarki.afghanet/api/product_option_values/ 176',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/combinations/?display = full& filter [id_product] = [4572]',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4572]',
'http://testpodarki.afghanet/ api / specific_prices /?display = full& filter [id_product] = [4572]',
'http://testpodarki.afghanet/api/product_option_values/176',
'http:// testpodarki。 afghanet / api / product_option_values / 181',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/images/products/4571/ 12159',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/179',
'http: //testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/stock_availables/ ?display = full& filter [id_product] = [4666]',
'http://testpodarki.afghanet/api/combinations/?display = full& filter [id_product] = [4576]',
'http: //testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product ] = [4576]',
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4576]',
'http:// testpodarki。 afghanet / api / images / products / 4572/12168',
'http://testpodarki.afghanet/api/product_option_values/185',
'http://testpodarki.afghanet/api/product_option_values/ 182',
'http://testpodarki.afghanet/api/product_option_values/184',
'http://testpodarki.afghanet/api/product_option_values/183',
'http: //testpodarki.afghanet/api/product_option_values/186',
'http://testpodarki.afghanet/api/images/products/4666/12413',
'http://testpodarki.afghanet/ api / combine /?display = full& filter [id_product] = [4574]',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4574] ,
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4574]',
'http://testpodarki.afghanet/api/product_option_values/177'
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/4576/12174',
'http ://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values / 179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http ://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4651]',
'http://testpodarki.afghanet/api/images/products/4574/12171'
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4651]',
'http://testpodarki.afghanet/api/images/products / 4 651/12387'
);
我的$ timer = Devel :: Timer-> new();
foreach我的$ el(@urlz){
我的$ request = HTTP :: Request-> new(GET => $ el);
$ request->标头(User_Agent =>代理);
$ request-> authorization_basic(passwd,'');
push @ qids,$ queue-> add($ request);
$ timer-> mark(push [$ el],to_send =。$ queue-> to_send_count()。,to_return =。$ queue-> to_return_count()。,in_progress = $ queue-> in_progress_count());
}
$ timer-> mark('requestz push');
while($ queue-> in_progress_count){
usleep(2000);
$ queue-> poke();
}
$ timer-> mark('requestz complited');
process_responses();
$ timer-> mark('responzez processed');
foreach我的$ q(@xmlz){
#print>>>>>>Dumper($ q)<< ;<<<<< \\\
;
}
$ timer-> report();
打印\\\
\\\
;
解决方案已更新与张贴的方法
我最好的结果与 HTTP :: Async 超过4个,最多超过5秒。据我了解,这种方法不是必需的,这里是一个简单的分支示例,需要一点点超过2秒,最多不超过3秒。
它使用 Parallel :: ForkManager 和 LWP :: UserAgent 进行下载。
使用警告;
使用strict;
使用Path :: Tiny;
使用LWP :: UserAgent;
使用Parallel :: ForkManager;
我的@urls = @ {get_urls('https://pastebin.com/raw/VyhMEB3w')};
我的$ pm = new Parallel :: ForkManager(60);一次最多只能有60个进程
my $ ua = LWP :: UserAgent-> new;
打印下载,标量@urls,files.\\\
;
我的$ dir ='downloaded_files /';
mkdir $ dir if not -d $ dir;
我的$ cnt = 0;
foreach我的$ link(@urls)
{
我的$ file =$ dir / file_。 ++ $ cnt。 '。文本';
$ pm->开始和下一个; #子进程
#添加实际页面所需的代码(授权等)
我的$ response = $ ua-> get($ link);
if($ response-> is_success){
path($ file) - > spew_utf8($ response-> decoded_content);
}
else {warn $ response-> status_line}
$ pm-> finish; #child exit
}
$ pm-> wait_all_children;
sub get_urls {
my $ resp = LWP :: UserAgent-> new-> get($ _ [0]);
return [grep / ^ http:/,split / \s *'?,?\s * \\\
\s *'?/,$ resp-> decoded_content];
};
这些文件是使用 Path :: Tiny 。它的
路径
构建一个对象,spew
例程写入文件。
为了参考,顺序下载大约需要26秒。
将最大进程数设置为30,这需要超过4秒,而60秒是稍微超过2秒,与(最多)相同)这个测试中有70个网址。
在具有良好网络连接的4核笔记本电脑上测试。 (这里的CPU并不重要)。测试重复运行,多次和多天。
与问题的方法进行比较
最好的
HTTP :: Async
结果比上述要慢2倍左右。他们有30-40插槽,因为更高的数字时间上升,什么谜题(我)。该模块使用select
通过 Net :: HTTP :: NB (非阻塞版本的 Net :: HTTP )。虽然选择
不能很好地扩展,但这需要数百个套接字,我希望能够在这个网络绑定问题上使用40多个。简单的分叉方法。
另外,
选择
被认为是监控套接字的缓慢方法,而叉子甚至不需要,因为每个进程都有自己的URL。 (这可能导致模块的开销很多连接?)叉的固有开销是固定的,并且由于网络访问而变得更矮。如果我们之后(许多)数百次下载,系统可能会受到进程的压力,但是选择
也不会很好。
最后,
选择
方法一次下载严格的一个文件,
,通过打印作为请求看到的效果是添加
ed - 我们可以看到延迟。分叉的下载并行(在这种情况下,所有70在同一时间没有问题)。那么会出现一个网络或磁盘瓶颈,但是与增益相比是微不足道的。
更新:将站点和进程数量翻一番,看不到OS / CPU应变的迹象,并保持平均速度。
所以我想说,如果你需要刮掉每隔一秒的使用叉子。但是,如果这不是关键的,而且还有其他的好处,那就是 c>
b
$ b
执行得很好的
HTTP :: Async
代码简单foreach我的$ link(@urls){
$ async-> add(HTTP :: Request-> new( GET => $ link));
}
while(我的$ response = $ async-> wait_for_next_response){
#写入文件(或其他过程)
}
我也尝试调整标题和时间。 (这包括根据建议删除
保持活着
$ request-> header(Connection =>'close')
,无效。)I'm trying to GET about 7 dozens of urls in parallel with scripts: the first is below, with HTTP::Async, and the second one is on pastebin, with Net::Async::HTTP. The problem is that I'm getting pretty same timing results - about 8..14 seconds for all urls list. It's inacceptable slow compared to curl+xargs started from shell, which gets all in less than 3 seconds with 10-20 "threads". For example, Devel::Timer in first script shows that max queue length is even less than 6 (
$queue->in_progress_count
<=5,$queue->to_send_count
=0 allways). So, it's looks like foreach with $queue->add is executing too slow, and I don't know why. Pretty same situation I got with Net::Async::HTTP (second script on pastebin), which is even slower than the first.So, please, does anybody know, what I'm doing wrong? How can I get concurrent download speed at least compared to curl+xargs started from shell?
#!/usr/bin/perl -w use utf8; use strict; use POSIX qw(ceil); use XML::Simple; use Data::Dumper; use HTTP::Request; use HTTP::Async; use Time::HiRes qw(usleep time); use Devel::Timer; #settings use constant passwd => 'ultramegahypapassword'; use constant agent => 'supa agent dev.alpha'; use constant timeout => 10; use constant slots => 10; use constant debug => 1; my @qids; my @xmlz; my $queue = HTTP::Async->new(slots => slots,max_request_time => 10, timeout => timeout, poll_interval => 0.0001); my %responses; my @urlz = ( 'http://testpodarki.afghanet/api/products/4577', 'http://testpodarki.afghanet/api/products/4653', 'http://testpodarki.afghanet/api/products/4652', 'http://testpodarki.afghanet/api/products/4571', 'http://testpodarki.afghanet/api/products/4572', 'http://testpodarki.afghanet/api/products/4666', 'http://testpodarki.afghanet/api/products/4576', 'http://testpodarki.afghanet/api/products/4574', 'http://testpodarki.afghanet/api/products/4651', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[3294]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[3294]', 'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4577]', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4577]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4577]', 'http://testpodarki.afghanet/api/product_option_values/188', 'http://testpodarki.afghanet/api/product_option_values/191', 'http://testpodarki.afghanet/api/product_option_values/187', 'http://testpodarki.afghanet/api/product_option_values/190', 'http://testpodarki.afghanet/api/product_option_values/189', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4653]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4653]', 'http://testpodarki.afghanet/api/images/products/4577/12176', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4652]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4652]', 'http://testpodarki.afghanet/api/images/products/4653/12390', 'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4571]', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4571]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4571]', 'http://testpodarki.afghanet/api/images/products/4652/12388', 'http://testpodarki.afghanet/api/product_option_values/175', 'http://testpodarki.afghanet/api/product_option_values/178', 'http://testpodarki.afghanet/api/product_option_values/179', 'http://testpodarki.afghanet/api/product_option_values/180', 'http://testpodarki.afghanet/api/product_option_values/181', 'http://testpodarki.afghanet/api/images/products/3294/8965', 'http://testpodarki.afghanet/api/product_option_values/176', 'http://testpodarki.afghanet/api/product_option_values/177', 'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4572]', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4572]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4572]', 'http://testpodarki.afghanet/api/product_option_values/176', 'http://testpodarki.afghanet/api/product_option_values/181', 'http://testpodarki.afghanet/api/product_option_values/180', 'http://testpodarki.afghanet/api/images/products/4571/12159', 'http://testpodarki.afghanet/api/product_option_values/177', 'http://testpodarki.afghanet/api/product_option_values/179', 'http://testpodarki.afghanet/api/product_option_values/175', 'http://testpodarki.afghanet/api/product_option_values/178', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4666]', 'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4576]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4666]', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4576]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4576]', 'http://testpodarki.afghanet/api/images/products/4572/12168', 'http://testpodarki.afghanet/api/product_option_values/185', 'http://testpodarki.afghanet/api/product_option_values/182', 'http://testpodarki.afghanet/api/product_option_values/184', 'http://testpodarki.afghanet/api/product_option_values/183', 'http://testpodarki.afghanet/api/product_option_values/186', 'http://testpodarki.afghanet/api/images/products/4666/12413', 'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4574]', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4574]', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4574]', 'http://testpodarki.afghanet/api/product_option_values/177', 'http://testpodarki.afghanet/api/product_option_values/181', 'http://testpodarki.afghanet/api/images/products/4576/12174', 'http://testpodarki.afghanet/api/product_option_values/176', 'http://testpodarki.afghanet/api/product_option_values/180', 'http://testpodarki.afghanet/api/product_option_values/179', 'http://testpodarki.afghanet/api/product_option_values/175', 'http://testpodarki.afghanet/api/product_option_values/178', 'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4651]', 'http://testpodarki.afghanet/api/images/products/4574/12171', 'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4651]', 'http://testpodarki.afghanet/api/images/products/4651/12387' ); my $timer = Devel::Timer->new(); foreach my $el (@urlz) { my $request = HTTP::Request->new(GET => $el); $request->header(User_Agent => agent); $request->authorization_basic(passwd,''); push @qids,$queue->add($request); $timer->mark("pushed [$el], to_send=".$queue->to_send_count().", to_return=".$queue->to_return_count().", in_progress=".$queue->in_progress_count()); } $timer->mark('requestz pushed'); while ($queue->in_progress_count) { usleep(2000); $queue->poke(); } $timer->mark('requestz complited'); process_responses(); $timer->mark('responzez processed'); foreach my $q (@xmlz) { # print ">>>>>>".Dumper($q)."<<<<<<<<\n"; } $timer->report(); print "\n\n";
解决方案Updated to my experimentation with the posted approach
My best results with HTTP::Async are well over 4 and up to over 5 seconds. As I understand this approach isn't required, and here is a simple forking example that takes a little over 2 and at most below 3 seconds.
It uses Parallel::ForkManager and LWP::UserAgent for downloads.
use warnings; use strict; use Path::Tiny; use LWP::UserAgent; use Parallel::ForkManager; my @urls = @{ get_urls('https://pastebin.com/raw/VyhMEB3w') }; my $pm = new Parallel::ForkManager(60); # max of 60 processes at a time my $ua = LWP::UserAgent->new; print "Downloading ", scalar @urls, " files.\n"; my $dir = 'downloaded_files/'; mkdir $dir if not -d $dir; my $cnt = 0; foreach my $link (@urls) { my $file = "$dir/file_" . ++$cnt . '.txt'; $pm->start and next; # child process # add code needed for actual pages (authorization etc) my $response = $ua->get($link); if ($response->is_success) { path($file)->spew_utf8($response->decoded_content); } else { warn $response->status_line } $pm->finish; # child exit } $pm->wait_all_children; sub get_urls { my $resp = LWP::UserAgent->new->get($_[0]); return [ grep /^http:/, split /\s*'?,?\s*\n\s*'?/, $resp->decoded_content ]; };
The files are written using Path::Tiny. Its
path
builds an object andspew
routines write the file.For reference, the sequential downloads take around 26 seconds.
With the maximum number of processes set to 30 this takes over 4 seconds, and with 60 it is a little over 2 seconds, about the same as with (up to) 90. There are 70 urls in this test.
Tested at a 4-core laptop with a decent network connection. (Here the CPU isn't all that important.) The tests were run repeatedly, at multiple times and on multiple days.
A comparison with the approach from the question
The best
HTTP::Async
results are slower than the above by around a factor of two. They are with 30-40 "slots" since for higher numbers the time goes up, what puzzles (me). The module usesselect
to multiplex, via Net::HTTP::NB (a non-blocking version of Net::HTTP). Whileselect
"does not scale well" this regards hundreds of sockets and I'd expect to be able to use more than 40 on this network bound problem. The simple forked approach does.Also,
select
is considered to be a slow method to monitor sockets while forks don't even need that, as each process has its own url. (This may result in module's overhead with many connections?) Fork's inherent overhead is fixed and dwarfed by network access. If we were after (many) hundreds of downloads the system may get strained by processes, butselect
wouldn't fare well either.Finally,
select
based methods download strictly one file at a time, and the effect is seen by printing as requests areadd
ed -- we can see the delay. The forked downloads go in parallel (in this case all 70 at the same time without a problem). Then there'll be a network or disk bottleneck but that is tiny in comparison to the gain.Update: I pushed this to double the number of sites and processes, saw no signs of OS/CPU strain, and retained the average speed.
So I'd say, if you need to shave off every second use forks. But if this is not critical and there are other benefits of
HTTP::Async
(or such) then be content with (just a bit) longer downloads.
The
HTTP::Async
code that performs well ended up being simplyforeach my $link ( @urls ) { $async->add( HTTP::Request->new(GET => $link) ); } while ( my $response = $async->wait_for_next_response ) { # write file (or process otherwise) }
I have also tried to tweak headers and timings. (This included dropping
keep-alive
as suggested, by$request->header(Connection => 'close')
, to no effect.)这篇关于Perl太慢并发下载与HTTP :: Async&网::异步HTTP ::的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!