多线程,多curl爬虫在PHP [英] multi-thread, multi-curl crawler in PHP

查看:266
本文介绍了多线程,多curl爬虫在PHP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

再次欢迎大家!



我们需要一些帮助,以便在我们的抓取工具中开发和实施多卷曲功能。



我们使用一些伪代码来理解逻辑:

  1)While($ links_to_be_scanned> 0)。 
2)Foreach($ links_to_be_scanned为$ link_to_be_scanned)。
3)Scan_the_link()并运行一些其他函数。
4)从xdom中提取新的链接。
5)将新链接推送到$ links_to_be_scanned。
5)将当前链接推送到$ links_already_scanned。
6)从$ links_to_be_scanned中删除当前链接。

现在,我们需要定义最大并行连接数,



我知道我们必须创建一个$ links_being_scanned或某种队列。



我真的不知道如何处理这个问题,老实说,如果任何人可以提供一些片段或想法来解决它,将非常感激。



提前感谢!
Chris;



扩展:



我刚刚意识到, -curl本身是棘手的部分,但是在请求后用每个链接完成的操作量。



即使在muticurl之后,我最终还是要找到一种方法来运行所有这些操作并行。下面描述的整个算法必须并行运行。



现在重新思考,我们必须这样做:

  While(有要扫描的链接)
Foreach($ Link_to_scann为$ link)
如果(运行少于10个扫描程序)
Launch_a_new_scanner($ link)
从$ links_to_be_scanned数组中删除链接
将链接推入$ links_on_queue数组
Endif;

每个扫描器都应该(这应该并行运行): / p>

 使用给定链接创建对象
向给定链接发送curl请求
创建dom和一个Xdom与响应主体
对响应主体执行其他操作
从$ links_on_queue数组中删除链接
将链接推入$ links_already_scanned数组



我假设我们可以用扫描器算法创建一个新的PHP文件,并为每个并行进程使用pcntl_fork / strong>



由于即使使用多重卷曲,我最终也必须等待循环其他进程的常规foreach结构。



我假设我必须使用fsockopen或pcntl_fork来处理。



建议,注释,部分解决方案,甚至是非常感谢!



非常感谢!

解决方案


DISCLAIMER:此回答链接了我参与的一个开源项目。那里。您已收到警告。


Artax HTTP客户端是一个基于套接字的HTTP库,(在其他情况下)提供对多个并发打开套接字连接到个别主机的数量,同时进行多个异步HTTP请求。



限制并发连接数是很容易实现的。考虑:

 <?php 

使用Artax \Client,Artax \Response;

需要dirname(__ DIR__)。 '/autoload.php';

$ client = new Client;

//默认为每个主机最多8个并发连接
$ client-> setOption('maxConnectionsPerHost',2);

$ requests = array(
'so-home'=>'http://stackoverflow.com',
'so-php'=>'http: //stackoverflow.com/questions/tagged/php',
'so-python'=>'http://stackoverflow.com/questions/tagged/python',
'so-http' =>'http://stackoverflow.com/questions/tagged/http',
'so-html'=>'http://stackoverflow.com/questions/tagged/html',
'so-css'=>'http://stackoverflow.com/questions/tagged/css',
'so-js'=>'http://stackoverflow.com/questions/tagged/ javascript'
);

$ onResponse = function($ requestKey,Response $ r){
echo $ requestKey,'::',$ r-> getStatus
};

$ onError = function($ requestKey,Exception $ e){
echo $ requestKey,'::',$ e-> getMessage
}

$ client-> requestMulti($ requests,$ onResponse,$ onError);

重要:在上述示例中, :: requestMulti 方法正在异步地处理所有指定的请求 。因为每主机并发限制设置为 2 ,客户端将为前两个请求打开新连接,并随后将这些相同的套接字重新用于其他请求,排队请求直到两个插座中的一个可用。


Hi everyone once again!

We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.

Let's use some pseudo code to understand the logic:

    1) While ($links_to_be_scanned > 0).
    2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
    3) Scan_the_link() and run some other functions.
    4) Extract the new links from the xdom.
    5) Push the new links into $links_to_be_scanned.
    5) Push the current link into $links_already_scanned.
    6) Remove the current link from $links_to_be_scanned.

Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.

I understand that we're gonna have to create a $links_being_scanned or some kind of queue.

I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.

Thanks in advance! Chris;

Extended:

I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.

Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.

So now rethinking, we would have to do something like this:

  While (There's links to be scanned)
  Foreach ($Link_to_scann as $link)
  If (There's less than 10 scanners running)
  Launch_a_new_scanner($link)
  Remove the link from $links_to_be_scanned array
  Push the link into $links_on_queue array
  Endif;

And each scanner does (This should be run in parallel):

  Create an object with the given link
  Send a curl request to the given link
  Create a dom and an Xdom with the response body
  Perform other operations over the response body
  Remove the link from the $links_on_queue array
  Push the link into the $links_already_scanned array

I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?

Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.

I assume I would have to approach this using fsockopen or pcntl_fork.

Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!

Thanks a lot!

解决方案

DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.

The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.

Limiting the number of concurrent connections is easily accomplished. Consider:

<?php

use Artax\Client, Artax\Response;

require dirname(__DIR__) . '/autoload.php';

$client = new Client;

// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);

$requests = array(
    'so-home'    => 'http://stackoverflow.com',
    'so-php'     => 'http://stackoverflow.com/questions/tagged/php',
    'so-python'  => 'http://stackoverflow.com/questions/tagged/python',
    'so-http'    => 'http://stackoverflow.com/questions/tagged/http',
    'so-html'    => 'http://stackoverflow.com/questions/tagged/html',
    'so-css'     => 'http://stackoverflow.com/questions/tagged/css',
    'so-js'      => 'http://stackoverflow.com/questions/tagged/javascript'
);

$onResponse = function($requestKey, Response $r) {
    echo $requestKey, ' :: ', $r->getStatus();
};

$onError = function($requestKey, Exception $e) {
    echo $requestKey, ' :: ', $e->getMessage();
}

$client->requestMulti($requests, $onResponse, $onError);

IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.

这篇关于多线程,多curl爬虫在PHP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆