Perl太慢并发下载与HTTP :: Async&网::异步HTTP :: [英] Perl too slow concurrent download with both HTTP::Async & Net::Async::HTTP

查看:202
本文介绍了Perl太慢并发下载与HTTP :: Async&网::异步HTTP ::的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试与脚本并行获取大约7几个URL:第一个是HTTP :: Async,第二个是 on pastebin ,使用Net :: Async :: HTTP。
问题是我得到的时间结果非常相似 - 所有网址列表大约8..14秒。与从shell开始的curl + xargs相比,这是不可接受的,它使用10-20线程在不到3秒的时间内完成。
例如,第一个脚本中的Devel :: Timer显示最大队列长度甚至小于6( $ queue-> in_progress_count < = 5, code> $ queue-> to_send_count = 0 allways)。所以,看起来foreach与$ queue-> add执行得太慢,我不知道为什么。
与Net :: Async :: HTTP(pastebin中的第二个脚本)相同的情况,这比第一个更慢。



所以,请有谁知道,我做错了什么?至少与从shell开始的curl + xargs相比,如何获得并发下载速度?

 #!/ usr / bin / perl -w 
使用utf8;
使用strict;
使用POSIX qw(ceil);
使用XML :: Simple;
使用Data :: Dumper;
使用HTTP :: Request;
使用HTTP :: Async;
使用Time :: HiRes qw(usleep time);
使用Devel :: Timer;

#settings
使用常数passwd => ultramegahypapassword;
使用常量代理=> 'supa agent dev.alpha';
使用常量timeout => 10;
使用常量slot => 10;
使用常量debug => 1;

我的@qids;
我的@xmlz;
my $ queue = HTTP :: Async-> new(slots => slots,max_request_time => 10,timeout => timeout,poll_interval => 0.0001)
我的%回复;
我的@urlz =(
'http://testpodarki.afghanet/api/products/4577',
'http://testpodarki.afghanet/api/products/4653',
'http://testpodarki.afghanet/api/products/4652',
'http://testpodarki.afghanet/api/products/4571',
'http:// testpodarki。 afghanet / api / products / 4572',
'http://testpodarki.afghanet/api/products/4666',
'http://testpodarki.afghanet/api/products/4576',
'http://testpodarki.afghanet/api/products/4574',
'http://testpodarki.afghanet/api/products/4651',
'http:// testpodarki。 afghanet / api / stock_availables /?display = full& filter [id_product] = [3294]',
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [3294 ]',
'http://testpodarki.afghanet/api/combinations/?display = full& filter [id_product] = [4577]',
'http://testpodarki.afghanet/api/ stock_availables /?display = full& filter [id_product] = [4577]',
'http://testpodarki.afghanet/api/speci fic_prices /?display = full& filter [id_product] = [4577]',
'http://testpodarki.afghanet/api/product_option_values/188',
'http://testpodarki.afghanet/ api / product_option_values / 191',
'http://testpodarki.afghanet/api/product_option_values/187',
'http://testpodarki.afghanet/api/product_option_values/190',
'http://testpodarki.afghanet/api/product_option_values/189',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4653]',
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4653]',
'http://testpodarki.afghanet/api/images/products/ 4577/12176',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4652]',
'http://testpodarki.afghanet/ api / specific_prices /?display = full& filter [id_product] = [4652]',
'http://testpodarki.afghanet/api/images/products/4653/12390',
'http: //testpodarki.afgh anet / api / combine /?display = full& filter [id_product] = [4571]',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4571 ]',
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4571]',
'http://testpodarki.afghanet/api/图像/产品/ 4652/12388',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/180',
'http:// testpodarki。 afghanet / api / product_option_values / 181',
'http://testpodarki.afghanet/api/images/products/3294/8965',
'http://testpodarki.afghanet/api/product_option_values/ 176',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/combinations/?display = full& filter [id_product] = [4572]',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4572]',
'http://testpodarki.afghanet/ api / specific_prices /?display = full& filter [id_product] = [4572]',
'http://testpodarki.afghanet/api/product_option_values/176',
'http:// testpodarki。 afghanet / api / product_option_values / 181',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/images/products/4571/ 12159',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/179',
'http: //testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/stock_availables/ ?display = full& filter [id_product] = [4666]',
'http://testpodarki.afghanet/api/combinations/?display = full& filter [id_product] = [4576]',
'http: //testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product ] = [4576]',
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4576]',
'http:// testpodarki。 afghanet / api / images / products / 4572/12168',
'http://testpodarki.afghanet/api/product_option_values/185',
'http://testpodarki.afghanet/api/product_option_values/ 182',
'http://testpodarki.afghanet/api/product_option_values/184',
'http://testpodarki.afghanet/api/product_option_values/183',
'http: //testpodarki.afghanet/api/product_option_values/186',
'http://testpodarki.afghanet/api/images/products/4666/12413',
'http://testpodarki.afghanet/ api / combine /?display = full& filter [id_product] = [4574]',
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4574] ,
'http://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4574]',
'http://testpodarki.afghanet/api/product_option_values/177'
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/4576/12174',
'http ://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values / 179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http ://testpodarki.afghanet/api/specific_prices/?display = full& filter [id_product] = [4651]',
'http://testpodarki.afghanet/api/images/products/4574/12171'
'http://testpodarki.afghanet/api/stock_availables/?display = full& filter [id_product] = [4651]',
'http://testpodarki.afghanet/api/images/products / 4 651/12387'
);

我的$ timer = Devel :: Timer-> new();


foreach我的$ el(@urlz){
我的$ request = HTTP :: Request-> new(GET => $ el);
$ request->标头(User_Agent =>代理);
$ request-> authorization_basic(passwd,'');
push @ qids,$ queue-> add($ request);
$ timer-> mark(push [$ el],to_send =。$ queue-> to_send_count()。,to_return =。$ queue-> to_return_count()。,in_progress = $ queue-> in_progress_count());
}

$ timer-> mark('requestz push');

while($ queue-> in_progress_count){
usleep(2000);
$ queue-> poke();
}

$ timer-> mark('requestz complited');

process_responses();


$ timer-> mark('responzez processed');

foreach我的$ q(@xmlz){
#print>>>>>>Dumper($ q)<< ;<<<<< \\\
;
}

$ timer-> report();
打印\\\
\\\
;

解决方案

已更新与张贴的方法






我最好的结果与 HTTP :: Async 超过4个,最多超过5秒。据我了解,这种方法不是必需的,这里是一个简单的分支示例,需要一点点超过2秒,最多不超过3秒。



它使用 Parallel :: ForkManager LWP :: UserAgent 进行下载。

 使用警告; 
使用strict;
使用Path :: Tiny;
使用LWP :: UserAgent;
使用Parallel :: ForkManager;

我的@urls = @ {get_urls('https://pastebin.com/raw/VyhMEB3w')};

我的$ pm = new Parallel :: ForkManager(60);一次最多只能有60个进程
my $ ua = LWP :: UserAgent-> new;
打印下载,标量@urls,files.\\\
;

我的$ dir ='downloaded_files /';
mkdir $ dir if not -d $ dir;
我的$ cnt = 0;
foreach我的$ link(@urls)
{
我的$ file =$ dir / file_。 ++ $ cnt。 '。文本';

$ pm->开始和下一个; #子进程

#添加实际页面所需的代码(授权等)
我的$ response = $ ua-> get($ link);
if($ response-> is_success){
path($ file) - > spew_utf8($ response-> decoded_content);
}
else {warn $ response-> status_line}

$ pm-> finish; #child exit
}
$ pm-> wait_all_children;

sub get_urls {
my $ resp = LWP :: UserAgent-> new-> get($ _ [0]);
return [grep / ^ http:/,split / \s *'?,?\s * \\\
\s *'?/,$ resp-> decoded_content];
};

这些文件是使用 Path :: Tiny 。它的路径构建一个对象, spew 例程写入文件。



为了参考,顺序下载大约需要26秒。



将最大进程数设置为30,这需要超过4秒,而60秒是稍微超过2秒,与(最多)相同)这个测试中有70个网址。



在具有良好网络连接的4核笔记本电脑上测试。 (这里的CPU并不重要)。测试重复运行,多次和多天。






与问题的方法进行比较



最好的 HTTP :: Async 结果比上述要慢2倍左右。他们有30-40插槽,因为更高的数字时间上升,什么谜题(我)。该模块使用 select 通过 Net :: HTTP :: NB (非阻塞版本的 Net :: HTTP )。虽然选择不能很好地扩展,但这需要数百个套接字,我希望能够在这个网络绑定问题上使用40多个。简单的分叉方法。



另外,选择被认为是监控套接字的缓慢方法,而叉子甚至不需要,因为每个进程都有自己的URL。 (这可能导致模块的开销很多连接?)叉的固有开销是固定的,并且由于网络访问而变得更矮。如果我们之后(许多)数百次下载,系统可能会受到进程的压力,但是选择也不会很好。



最后,选择方法一次下载严格的一个文件,
,通过打印作为请求看到的效果是添加 ed - 我们可以看到延迟。分叉的下载并行(在这种情况下,所有70在同一时间没有问题)。那么会出现一个网络或磁盘瓶颈,但是与增益相比是微不足道的。



更新将站点和进程数量翻一番,看不到OS / CPU应变的迹象,并保持平均速度。



所以我想说,如果你需要刮掉每隔一秒的使用叉子。但是,如果这不是关键的,而且还有其他的好处,那就是 c>

b
$ b


执行得很好的 HTTP :: Async 代码简单

  foreach我的$ link(@urls){
$ async-> add(HTTP :: Request-> new( GET => $ link));
}
while(我的$ response = $ async-> wait_for_next_response){
#写入文件(或其他过程)
}

我也尝试调整标题和时间。 (这包括根据建议删除保持活着 $ request-> header(Connection =>'close'),无效。)


I'm trying to GET about 7 dozens of urls in parallel with scripts: the first is below, with HTTP::Async, and the second one is on pastebin, with Net::Async::HTTP. The problem is that I'm getting pretty same timing results - about 8..14 seconds for all urls list. It's inacceptable slow compared to curl+xargs started from shell, which gets all in less than 3 seconds with 10-20 "threads". For example, Devel::Timer in first script shows that max queue length is even less than 6 ($queue->in_progress_count<=5, $queue->to_send_count=0 allways). So, it's looks like foreach with $queue->add is executing too slow, and I don't know why. Pretty same situation I got with Net::Async::HTTP (second script on pastebin), which is even slower than the first.

So, please, does anybody know, what I'm doing wrong? How can I get concurrent download speed at least compared to curl+xargs started from shell?

#!/usr/bin/perl -w
use utf8;
use strict;
use POSIX qw(ceil);
use XML::Simple;
use Data::Dumper;
use HTTP::Request;
use HTTP::Async;
use Time::HiRes qw(usleep time);
use Devel::Timer;

#settings
use constant passwd => 'ultramegahypapassword';
use constant agent => 'supa agent dev.alpha';
use constant timeout => 10;
use constant slots => 10;
use constant debug => 1;

my @qids;
my @xmlz;
my $queue = HTTP::Async->new(slots => slots,max_request_time => 10, timeout => timeout, poll_interval => 0.0001);
my %responses;
my @urlz = (
'http://testpodarki.afghanet/api/products/4577',
'http://testpodarki.afghanet/api/products/4653',
'http://testpodarki.afghanet/api/products/4652',
'http://testpodarki.afghanet/api/products/4571',
'http://testpodarki.afghanet/api/products/4572',
'http://testpodarki.afghanet/api/products/4666',
'http://testpodarki.afghanet/api/products/4576',
'http://testpodarki.afghanet/api/products/4574',
'http://testpodarki.afghanet/api/products/4651',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[3294]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[3294]',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/product_option_values/188',
'http://testpodarki.afghanet/api/product_option_values/191',
'http://testpodarki.afghanet/api/product_option_values/187',
'http://testpodarki.afghanet/api/product_option_values/190',
'http://testpodarki.afghanet/api/product_option_values/189',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4653]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4653]',
'http://testpodarki.afghanet/api/images/products/4577/12176',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4652]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4652]',
'http://testpodarki.afghanet/api/images/products/4653/12390',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/images/products/4652/12388',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/3294/8965',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/images/products/4571/12159',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/images/products/4572/12168',
'http://testpodarki.afghanet/api/product_option_values/185',
'http://testpodarki.afghanet/api/product_option_values/182',
'http://testpodarki.afghanet/api/product_option_values/184',
'http://testpodarki.afghanet/api/product_option_values/183',
'http://testpodarki.afghanet/api/product_option_values/186',
'http://testpodarki.afghanet/api/images/products/4666/12413',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/4576/12174',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4651]',
'http://testpodarki.afghanet/api/images/products/4574/12171',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4651]',
'http://testpodarki.afghanet/api/images/products/4651/12387'
);

my $timer = Devel::Timer->new();


foreach my $el (@urlz) {
    my $request = HTTP::Request->new(GET => $el);
    $request->header(User_Agent => agent);
    $request->authorization_basic(passwd,''); 
    push @qids,$queue->add($request);
    $timer->mark("pushed [$el], to_send=".$queue->to_send_count().", to_return=".$queue->to_return_count().", in_progress=".$queue->in_progress_count());
}

$timer->mark('requestz pushed');

while ($queue->in_progress_count) {
    usleep(2000);
    $queue->poke();
}

$timer->mark('requestz complited');

process_responses();


$timer->mark('responzez processed');

foreach my $q (@xmlz) {
#    print ">>>>>>".Dumper($q)."<<<<<<<<\n";
}

$timer->report();
print "\n\n";

解决方案

Updated to my experimentation with the posted approach


My best results with HTTP::Async are well over 4 and up to over 5 seconds. As I understand this approach isn't required, and here is a simple forking example that takes a little over 2 and at most below 3 seconds.

It uses Parallel::ForkManager and LWP::UserAgent for downloads.

use warnings;
use strict;
use Path::Tiny;    
use LWP::UserAgent;
use Parallel::ForkManager;

my @urls = @{ get_urls('https://pastebin.com/raw/VyhMEB3w') };

my $pm = new Parallel::ForkManager(60);  # max of 60 processes at a time
my $ua = LWP::UserAgent->new; 
print "Downloading ", scalar @urls, " files.\n";

my $dir = 'downloaded_files/';
mkdir $dir if not -d $dir;
my $cnt = 0;   
foreach my $link (@urls) 
{
    my $file = "$dir/file_" . ++$cnt . '.txt';

    $pm->start and next;                        # child process

    # add code needed for actual pages (authorization etc)            
    my $response = $ua->get($link);        
    if ($response->is_success) {
        path($file)->spew_utf8($response->decoded_content);
    }
    else { warn $response->status_line }

    $pm->finish;                                # child exit
}
$pm->wait_all_children;

sub get_urls {
    my $resp = LWP::UserAgent->new->get($_[0]);
    return [ grep /^http:/, split /\s*'?,?\s*\n\s*'?/, $resp->decoded_content ];
};

The files are written using Path::Tiny. Its path builds an object and spew routines write the file.

For reference, the sequential downloads take around 26 seconds.

With the maximum number of processes set to 30 this takes over 4 seconds, and with 60 it is a little over 2 seconds, about the same as with (up to) 90. There are 70 urls in this test.

Tested at a 4-core laptop with a decent network connection. (Here the CPU isn't all that important.) The tests were run repeatedly, at multiple times and on multiple days.


A comparison with the approach from the question

The best HTTP::Async results are slower than the above by around a factor of two. They are with 30-40 "slots" since for higher numbers the time goes up, what puzzles (me). The module uses select to multiplex, via Net::HTTP::NB (a non-blocking version of Net::HTTP). While select "does not scale well" this regards hundreds of sockets and I'd expect to be able to use more than 40 on this network bound problem. The simple forked approach does.

Also, select is considered to be a slow method to monitor sockets while forks don't even need that, as each process has its own url. (This may result in module's overhead with many connections?) Fork's inherent overhead is fixed and dwarfed by network access. If we were after (many) hundreds of downloads the system may get strained by processes, but select wouldn't fare well either.

Finally, select based methods download strictly one file at a time, and the effect is seen by printing as requests are added -- we can see the delay. The forked downloads go in parallel (in this case all 70 at the same time without a problem). Then there'll be a network or disk bottleneck but that is tiny in comparison to the gain.

Update: I pushed this to double the number of sites and processes, saw no signs of OS/CPU strain, and retained the average speed.

So I'd say, if you need to shave off every second use forks. But if this is not critical and there are other benefits of HTTP::Async (or such) then be content with (just a bit) longer downloads.


The HTTP::Async code that performs well ended up being simply

foreach my $link ( @urls ) {  
    $async->add( HTTP::Request->new(GET => $link) );
}    
while ( my $response = $async->wait_for_next_response ) { 
    # write file (or process otherwise)
}

I have also tried to tweak headers and timings. (This included dropping keep-alive as suggested, by $request->header(Connection => 'close'), to no effect.)

这篇关于Perl太慢并发下载与HTTP :: Async&amp;网::异步HTTP ::的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆