使用pcntl_fork()提高HTML抓取工具的效率 [英] Improving HTML scraper efficiency with pcntl_fork()

查看:73
本文介绍了使用pcntl_fork()提高HTML抓取工具的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在前两个问题的帮助下,我现在有了一个运行中的HTML抓取程序,该抓取程序将产品信息馈送到数据库中.我现在想做的是通过使我的刮板与 pcntl_fork 配合使用,使我的大脑更加有效地进行改进.

With the help from two previous questions, I now have a working HTML scraper that feeds product information into a database. What I am now trying to do is improve efficiently by wrapping my brain around with getting my scraper working with pcntl_fork.

如果我将php5-cli脚本分成10个单独的块,则会在很大程度上提高总运行时间,因此我知道我不受I/O或CPU的限制,而仅受我的抓取函数的线性性质的限制.

If I split my php5-cli script into 10 separate chunks, I improve total runtime by a large factor so I know I am not i/o or cpu bound but just limited by the linear nature of my scraping functions.

使用我从多个来源收集来的代码,可以进行此工作测试:

Using code I've cobbled together from multiple sources, I have this working test:

<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0); 
ini_set('max_input_time', 0); 
set_time_limit(0);

$hrefArray = array("http://slashdot.org", "http://slashdot.org", "http://slashdot.org", "http://slashdot.org");

function doDomStuff($singleHref,$childPid) {
    $html = new DOMDocument();
    $html->loadHtmlFile($singleHref);

    $xPath = new DOMXPath($html);

    $domQuery = '//div[@id="slogan"]/h2';
    $domReturn = $xPath->query($domQuery);

    foreach($domReturn as $return) {
        $slogan = $return->nodeValue;
        echo "Child PID #" . $childPid . " says: " . $slogan . "\n";
    }
}

$pids = array();
foreach ($hrefArray as $singleHref) {
    $pid = pcntl_fork();

    if ($pid == -1) {
        die("Couldn't fork, error!");
    } elseif ($pid > 0) {
        // We are the parent
        $pids[] = $pid;
    } else {
        // We are the child
        $childPid = posix_getpid();
        doDomStuff($singleHref,$childPid);
        exit(0);
    }
}

foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
}

// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();

哪个提出了以下问题:

1)给定我的hrefArray包含4个url-如果该数组包含1000个产品url,则此代码将产生1000个子进程?如果是这样,最好的方法是将处理数量限制为10个,再以1,000个网址为例,将子工作量划分为每个孩子100个产品(10 x 100).

1) Given my hrefArray contains 4 urls - if the array was to contain say 1,000 product urls this code would spawn 1,000 child processes? If so, what is the best way to limit the amount of processes to say 10, and again 1,000 urls as an example split the child work load to 100 products per child (10 x 100).

2)我了解到pcntl_fork创建了流程以及所有变量,类等的副本.我想做的是将我的hrefArray变量替换为DOMDocument查询,该查询建立了要刮取的产品列表,并且然后将它们提供给子进程进行处理-因此将负载分散到10个子工作者中.

2) I've learn that pcntl_fork creates a copy of the process and all variables, classes, etc. What I would like to do is replace my hrefArray variable with a DOMDocument query that builds the list of products to scrape, and then feeds them off to child processes to do the processing - so spreading the load across 10 child workers.

我的大脑告诉我需要执行以下操作(显然这是行不通的,因此请不要运行它):

My brain is telling I need to do something like the following (obviously this doesn't work, so don't run it):

<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0); 
ini_set('max_input_time', 0); 
set_time_limit(0);
$maxChildWorkers = 10;

$html = new DOMDocument();
$html->loadHtmlFile('http://xxxx');
$xPath = new DOMXPath($html);

$domQuery = '//div[@id=productDetail]/a';
$domReturn = $xPath->query($domQuery);

$hrefsArray[] = $domReturn->getAttribute('href');

function doDomStuff($singleHref) {
    // Do stuff here with each product
}

// To figure out: Split href array into $maxChilderWorks # of workArray1, workArray2 ... workArray10. 
$pids = array();
foreach ($workArray(1,2,3 ... 10) as $singleHref) {
    $pid = pcntl_fork();

    if ($pid == -1) {
        die("Couldn't fork, error!");
    } elseif ($pid > 0) {
        // We are the parent
        $pids[] = $pid;
    } else {
        // We are the child
        $childPid = posix_getpid();
        doDomStuff($singleHref);
        exit(0);
    }
}


foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
}

// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();

但是我不知道如何只在主/父进程中构建hrefsArray []并将其反馈给子进程.目前,我尝试过的所有操作都会在子进程中引起循环. IE. my hrefsArray内置在主服务器和后续的每个子进程中.

But what I can't figure out is how to build my hrefsArray[] in the master/parent process only and feed it off to the child process. Currently everything I've tried causes loops in the child processes. I.e. my hrefsArray gets built in the master, and in each subsequent child process.

我确信我将彻底解决这一切,因此,仅向正确的方向轻轻推一下,将不胜感激.

I am sure I am going about this all totally wrong, so would greatly appreciate just general nudge in the right direction.

推荐答案

似乎我每天都建议这样做,但是您是否看过齿轮人?甚至还有一个文档齐全的 PECL类.

It seems like I suggest this daily, but have you looked at Gearman? There's even a well documented PECL class.

Gearman是一个工作队列系统.您将创建连接和侦听作业的工作程序,以及连接和发送作业的客户端.客户端可以等待所请求的作业完成,也可以将其触发而忘记.您可以选择让工作人员甚至发送回状态更新,以及他们在此过程中进行的距离.

Gearman is a work queue system. You'd create workers that connect and listen for jobs, and clients that connect and send jobs. The client can either wait for the requested job to be completed, or it can fire it and forget. At your option, workers can even send back status updates, and how far through the process they are.

换句话说,您可以获得多个进程或线程的好处,而不必担心进程和线程.客户和工作人员甚至可以在不同的计算机上.

In other words, you get the benefits of multiple processes or threads, without having to worry about processes and threads. The clients and workers can even be on different machines.

这篇关于使用pcntl_fork()提高HTML抓取工具的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆