Perl,如何从URL并行获取数据? [英] Perl, how to fetch data from urls in parallel?

查看:89
本文介绍了Perl,如何从URL并行获取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从许多不提供任何服务的Web数据提供程序中获取一些数据,因此我必须使用WWW :: Mechanize:

I need to fetch some data from many web data providers, who do not expose any service, so I have to write something like this, using for example WWW::Mechanize:

use WWW::Mechanize;
@urls = ('http://www.first.data.provider.com', 'http://www.second.data.provider.com', 'http://www.third.data.provider.com');
%results = {};
foreach my $url (@urls) {
 $mech = WWW::Mechanize->new();
 $mech->get($url);
 $mech->form_number(1);
 $mech->set_fields('user' => 'myuser', pass => 'mypass');
 $resp = $mech->submit();
 $results{$url} = parse($resp->content());
}
consume(%results);

是否存在某种(可能是简单的;-)方法,可同时 将数据获取到一个通用的%results变量,即:并行地从提供者的所有 ?

Is there some (possibly simple ;-) way to fetch data to a common %results variable, simultaneously, i.e: in parallel, from all the providers?

推荐答案

threads. use threads主要用于 在Windows上模拟UNIX样式的fork;除此之外,它毫无意义.

threads are to be avoided in Perl. use threads is mostly for emulating UNIX-style fork on Windows; beyond that, it's pointless.

(如果您愿意的话,该实现将使这一事实非常清楚.在perl中, 解释器是一个PerlInterpreter对象.方式threads 作品是通过制作一堆线程,然后创建一个全新的 每个线程中的PerlInterpreter对象.线程绝对共享 什么都没有,甚至比子进程还少; fork可以帮助您 写时复制,但使用threads,所有复制均在Perl中完成 空间!慢!)

(If you care, the implementation makes this fact very clear. In perl, the interpreter is a PerlInterpreter object. The way threads works is by making a bunch of threads, and then creating a brand-new PerlInterpreter object in each thread. Threads share absolutely nothing, even less than child processes do; fork gets you copy-on-write, but with threads, all the copying is done in Perl space! Slow!)

如果您想在同一过程中同时做很多事情, 在Perl中做到这一点的方法是使用事件循环,例如 EV 事件,或 POE ,或使用Coro. (你可以 还可以使用AnyEvent API编写代码,这将使 您可以使用任何事件循环.这就是我的偏爱.)区别 两者之间是您编写代码的方式.

If you'd like to do many things concurrently in the same process, the way to do that in Perl is with an event loop, like EV, Event, or POE, or by using Coro. (You can also write your code in terms of the AnyEvent API, which will let you use any event loop. This is what I prefer.) The difference between the two is how you write your code.

AnyEvent (以及EV,Event, POE等)强制您以面向回调的方式编写代码 风格.而不是控制从上到下流动,控制在 延续传递风格.函数不返回值,而是调用 其他功能及其结果.这使您可以运行许多IO 并行操作-当给定的IO操作产生时 结果,将调用您处理这些结果的函数.什么时候 另一个IO操作完成后,将调用该功能.和 等等.

AnyEvent (and EV, Event, POE, and so on) forces you to write your code in a callback-oriented style. Instead of control flowing from top to bottom, control is in a continuation-passing style. Functions don't return values, they call other functions with their results. This allows you to run many IO operations in parallel -- when a given IO operation has yielded results, your function to handle those results will be called. When another IO operation is complete, that function will be called. And so on.

这种方法的缺点是您必须重写您的 代码.因此,有一个名为Coro的模块可以使Perl真正 (用户空间)线程,这些线程使您可以自上而下地编写代码, 但仍然是非阻塞的. (这样做的缺点是 大量修改了Perl的内部结构.但这似乎效果很好.)

The disadvantage of this approach is that you have to rewrite your code. So there's a module called Coro that gives Perl real (user-space) threads that will let you write your code top-to-bottom, but still be non-blocking. (The disadvantage of this is that it heavily modifies Perl's internals. But it seems to work pretty well.)

因此,由于我们不想重写 WWW :: Mechanize 今晚,我们将使用Coro. Coro带有一个名为 Coro :: LWP 可以使 对 LWP 的所有调用 非阻塞.它将阻止当前线程(在Coro中为协程" lingo),但它不会阻止任何其他线程.这意味着你可以 一次处理大量请求,并在结果变为结果时对其进行处理 可用的.而且Coro将比您的网络连接更好地扩展; 每个协程仅使用几千个内存,因此很容易拥有数十个 周围成千上万的人.

So, since we don't want to rewrite WWW::Mechanize tonight, we're going to use Coro. Coro comes with a module called Coro::LWP that will make all calls to LWP be non-blocking. It will block the current thread ("coroutine", in Coro lingo), but it won't block any other threads. That means you can make a ton of requests all at once, and process the results as they become available. And Coro will scale better than your network connection; each coroutine uses just a few k of memory, so it's easy to have tens of thousands of them around.

牢记这一点,让我们看一些代码.这是一个开始的程序 三个并行的HTTP请求,并打印每个HTTP的长度 回复.它与您正在执行的操作类似,减去实际值 加工;但您只需将代码放入我们计算 长度,并且工作原理相同.

With that in mind, let's see some code. Here's a program that starts three HTTP requests in parallel, and prints the length of each response. It's similar to what you're doing, minus the actual processing; but you can just put your code in where we calculate the length and it will work the same.

我们将以常用的Perl脚本样板开始:

We'll start off with the usual Perl script boilerplate:

#!/usr/bin/env perl

use strict;
use warnings;

然后,我们将加载特定于Coro的模块:

Then we'll load the Coro-specific modules:

use Coro;
use Coro::LWP;
use EV;

Coro在幕后使用事件循环;如果你会选一个 您想要的,但我们只是明确指定EV.这是最好的事件 循环.

Coro uses an event loop behind the scenes; it will pick one for you if you want, but we'll just specify EV explicitly. It's the best event loop.

然后,我们将加载工作所需的模块,这仅仅是:

Then we'll load the modules we need for our work, which is just:

use WWW::Mechanize;

现在我们准备编写程序了.首先,我们需要一个URL列表:

Now we're ready to write our program. First, we need a list of URLs:

my @urls = (
    'http://www.google.com/',
    'http://www.jrock.us/',
    'http://stackoverflow.com/',
);

然后,我们需要一个函数来产生线程并进行工作.为了使 在Coro上的新线程中,您像async { body; of the thread; goes here }一样调用async.这将创建一个线程,启动它,并 继续执行该程序的其余部分.

Then we need a function to spawn a thread and do our work. To make a new thread on Coro, you call async like async { body; of the thread; goes here }. This will create a thread, start it, and continue with the rest of the program.

sub start_thread($) {
    my $url = shift;
    return async {
        say "Starting $url";
        my $mech = WWW::Mechanize->new;
        $mech->get($url);
        printf "Done with $url, %d bytes\n", length $mech->content;
    };
}

这是我们程序的重点.我们只是放了普通的LWP程序 内部异步,这将是神奇的非阻塞. get块, 但是其他协程将在等待获取数据的同时运行 通过网络.

So here's the meat of our program. We just put our normal LWP program inside async, and it will be magically non-blocking. get blocks, but the other coroutines will run while waiting for it to get the data from the network.

现在我们只需要启动线程:

Now we just need to start the threads:

start_thread $_ for @urls;

最后,我们要开始处理事件:

And finally, we want to start handling events:

EV::loop;

就是这样.运行此命令时,您将看到一些输出,例如:

And that's it. When you run this, you'll see some output like:

Starting http://www.google.com/
Starting http://www.jrock.us/
Starting http://stackoverflow.com/
Done with http://www.jrock.us/, 5456 bytes
Done with http://www.google.com/, 9802 bytes
Done with http://stackoverflow.com/, 194555 bytes

如您所见,请求是并行进行的,您没有 求助于threads

As you can see, the requests are made in parallel, and you didn't have to resort to threads!

更新

您在原始帖子中提到要限制并行运行的HTTP请求的数量.一种方法是使用信号量, 在Coro中 Coro :: Semaphore .

You mentioned in your original post that you want to limit the number of HTTP requests that run in parallel. One way to do that is with a semaphore, Coro::Semaphore in Coro.

信号量就像一个计数器.当您要使用信号灯保护的资源时,可以降低"信号灯.这将使计数器递减并继续运行您的程序.但是,当您尝试降低信号量时,如果计数器为零,则线程/协程将进入睡眠状态,直到非零.当计数再次增加时,您的线程将唤醒,降低信号量并继续.最后,当您使用完信号灯保护的资源后,您就可以增加"信号灯并为其他线程提供运行的机会.

A semaphore is like a counter. When you want to use the resource that a semaphore protects, you "down" the semaphore. This decrements the counter and continues running your program. But if the counter is at zero when you try to down the semaphore, your thread/coroutine will go to sleep until it is non-zero. When the count goes up again, your thread will wake up, down the semaphore, and continue. Finally, when you're done using the resource that the semaphore protects, you "up" the semaphore and give other threads the chance to run.

这使您可以控制对共享资源的访问,例如发出HTTP请求".

This lets you control access to a shared resource, like "making HTTP requests".

您需要做的就是创建一个信号量,供您的HTTP请求线程共享:

All you need to do is create a semaphore that your HTTP request threads will share:

my $sem = Coro::Semaphore->new(5);

5的意思是让我们在阻塞之前先叫'down'5次",换句话说,就是让5个并发HTTP请求".

The 5 means "let us call 'down' 5 times before we block", or, in other words, "let there be 5 concurrent HTTP requests".

在添加任何代码之前,让我们先谈谈可能出错的地方.可能发生的不好的情况是线程关闭"了信号量,但是从不增加"了信号量.这样一来,任何人都无法使用该资源,并且您的程序可能最终将无所事事.有很多可能发生这种情况的方式.如果您编写了$sem->down; do something; $sem->up之类的代码,您可能会感到安全,但是如果执行某些操作"引发异常怎么办?然后信号灯将被放下,这很糟糕.

Before we add any code, let's talk about what can go wrong. Something bad that could happen is a thread "down"-ing the semaphore, but never "up"-ing it when it's done. Then nothing can ever use that resource, and your program will probably end up doing nothing. There are lots of ways this could happen. If you wrote some code like $sem->down; do something; $sem->up, you might feel safe, but what if "do something" throws an exception? Then the semaphore will be left down, and that's bad.

幸运的是,Perl使作用域 Guard 对象变得容易,当包含该对象的变量时它将自动运行代码超出范围.我们可以将代码设置为$sem->up,这样我们就不必担心不打算拥有资源的情况.

Fortunately, Perl makes it easy to have scope Guard objects, that will automatically run code when the variable holding the object goes out of scope. We can make the code be $sem->up, and then we'll never have to worry about holding a resource when we don't intend to.

Coro :: Semaphore集成了警卫的概念,这意味着您可以说my $guard = $sem->guard,当控件从您调用guard的作用域移开时,它将自动降低信号灯并使其升高.

Coro::Semaphore integrates the concept of guards, meaning you can say my $guard = $sem->guard, and that will automatically down the semaphore and up it when control flows away from the scope where you called guard.

请记住,限制并行请求数量的所有操作是guard使用协程的HTTP顶部的信号灯:

With that in mind, all we have to do to limit the number of parallel requests is guard the semaphore at the top of our HTTP-using coroutines:

async {
    say "Waiting for semaphore";
    my $guard = $sem->guard;
    say "Starting";
    ...;
    return result;
}

解决评论:

如果您不希望程序永久存在,则有一些选择.一种是在另一个线程中运行事件循环,然后在每个工作线程中运行join.这也使您也可以将结果从线程传递到主程序:

If you don't want your program to live forever, there are a few options. One is to run the event loop in another thread, and then join on each worker thread. This lets you pass results from the thread to the main program, too:

async { EV::loop };

# start all threads
my @running = map { start_thread $_ } @urls;

# wait for each one to return
my @results = map { $_->join } @running;

for my $result (@results) {
    say $result->[0], ': ', $result->[1];
}

您的线程可以返回如下结果:

Your threads can return results like:

sub start_thread($) {
    return async {
        ...;
        return [$url, length $mech->content];
    }
}

这是一种将所有结果收集到数据结构中的方法.如果您不想返回任何东西,请记住所有协程都具有共享状态.因此,您可以输入:

This is one way to collect all your results in a data structure. If you don't want to return things, remember that all the coroutines share state. So you can put:

my %results;

位于程序顶部,并让每个协程更新结果:

at the top of your program, and have each coroutine update the results:

async {
    ...;
    $results{$url} = 'whatever';
};

所有协程运行完毕后,结果将填充您的哈希.不过,您必须join每个协同程序才能知道答案何时准备就绪.

When all the coroutines are done running, your hash will be filled with the results. You'll have to join each coroutine to know when the answer is ready, though.

最后,如果要作为Web服务的一部分来执行此操作,则应该使用可识别协程的Web服务器,例如 Corona> Corona .这将在协程中运行每个HTTP请求,除了能够并行发送 HTTP请求之外,还允许您并行处理多个HTTP请求.这样可以很好地利用内存,CPU和网络资源,并且非常容易维护!

Finally, if you are doing this as part of a web service, you should use a coroutine-aware web server like Corona. This will run each HTTP request in a coroutine, allowing you to handle multiple HTTP requests in parallel, in addition to being able to send HTTP requests in parallel. This will make very good use of memory, CPU, and network resources, and will be pretty easy to maintain!

(您基本上可以从上方将我们的程序剪切n粘贴到处理HTTP请求的协程中;可以在协程内部创建新的协程和join.)

(You can basically cut-n-paste our program from above into the coroutine that handles the HTTP request; it's fine to create new coroutines and join inside a coroutine.)

这篇关于Perl,如何从URL并行获取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆