使用pcntl_fork()提高HTML抓取工具的效率 [英] Improving HTML scraper efficiency with pcntl_fork()

查看：73 发布时间：2020/11/9 23:05:59 php fork pcntl

本文介绍了使用pcntl_fork()提高HTML抓取工具的效率的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在前两个问题的帮助下，我现在有了一个运行中的HTML抓取程序，该抓取程序将产品信息馈送到数据库中.我现在想做的是通过使我的刮板与 pcntl_fork 配合使用，使我的大脑更加有效地进行改进.

With the help from two previous questions, I now have a working HTML scraper that feeds product information into a database. What I am now trying to do is improve efficiently by wrapping my brain around with getting my scraper working with pcntl_fork.

如果我将php5-cli脚本分成10个单独的块，则会在很大程度上提高总运行时间，因此我知道我不受I/O或CPU的限制，而仅受我的抓取函数的线性性质的限制.

If I split my php5-cli script into 10 separate chunks, I improve total runtime by a large factor so I know I am not i/o or cpu bound but just limited by the linear nature of my scraping functions.

使用我从多个来源收集来的代码，可以进行此工作测试:

Using code I've cobbled together from multiple sources, I have this working test:

<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0); 
ini_set('max_input_time', 0); 
set_time_limit(0);

$hrefArray = array("http://slashdot.org", "http://slashdot.org", "http://slashdot.org", "http://slashdot.org");

function doDomStuff($singleHref,$childPid) {
    $html = new DOMDocument();
    $html->loadHtmlFile($singleHref);

    $xPath = new DOMXPath($html);

    $domQuery = '//div[@id="slogan"]/h2';
    $domReturn = $xPath->query($domQuery);

    foreach($domReturn as $return) {
        $slogan = $return->nodeValue;
        echo "Child PID #" . $childPid . " says: " . $slogan . "\n";
    }
}

$pids = array();
foreach ($hrefArray as $singleHref) {
    $pid = pcntl_fork();

    if ($pid == -1) {
        die("Couldn't fork, error!");
    } elseif ($pid > 0) {
        // We are the parent
        $pids[] = $pid;
    } else {
        // We are the child
        $childPid = posix_getpid();
        doDomStuff($singleHref,$childPid);
        exit(0);
    }
}

foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
}

// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();

哪个提出了以下问题:

1)给定我的hrefArray包含4个url-如果该数组包含1000个产品url，则此代码将产生1000个子进程?如果是这样，最好的方法是将处理数量限制为10个，再以1,000个网址为例，将子工作量划分为每个孩子100个产品(10 x 100).

1) Given my hrefArray contains 4 urls - if the array was to contain say 1,000 product urls this code would spawn 1,000 child processes? If so, what is the best way to limit the amount of processes to say 10, and again 1,000 urls as an example split the child work load to 100 products per child (10 x 100).

2)我了解到pcntl_fork创建了流程以及所有变量，类等的副本.我想做的是将我的hrefArray变量替换为DOMDocument查询，该查询建立了要刮取的产品列表，并且然后将它们提供给子进程进行处理-因此将负载分散到10个子工作者中.

2) I've learn that pcntl_fork creates a copy of the process and all variables, classes, etc. What I would like to do is replace my hrefArray variable with a DOMDocument query that builds the list of products to scrape, and then feeds them off to child processes to do the processing - so spreading the load across 10 child workers.

我的大脑告诉我需要执行以下操作(显然这是行不通的，因此请不要运行它):

My brain is telling I need to do something like the following (obviously this doesn't work, so don't run it):

<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0); 
ini_set('max_input_time', 0); 
set_time_limit(0);
$maxChildWorkers = 10;

$html = new DOMDocument();
$html->loadHtmlFile('http://xxxx');
$xPath = new DOMXPath($html);

$domQuery = '//div[@id=productDetail]/a';
$domReturn = $xPath->query($domQuery);

$hrefsArray[] = $domReturn->getAttribute('href');

function doDomStuff($singleHref) {
    // Do stuff here with each product
}

// To figure out: Split href array into $maxChilderWorks # of workArray1, workArray2 ... workArray10. 
$pids = array();
foreach ($workArray(1,2,3 ... 10) as $singleHref) {
    $pid = pcntl_fork();

    if ($pid == -1) {
        die("Couldn't fork, error!");
    } elseif ($pid > 0) {
        // We are the parent
        $pids[] = $pid;
    } else {
        // We are the child
        $childPid = posix_getpid();
        doDomStuff($singleHref);
        exit(0);
    }
}


foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
}

// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();

但是我不知道如何只在主/父进程中构建hrefsArray []并将其反馈给子进程.目前，我尝试过的所有操作都会在子进程中引起循环. IE. my hrefsArray内置在主服务器和后续的每个子进程中.

But what I can't figure out is how to build my hrefsArray[] in the master/parent process only and feed it off to the child process. Currently everything I've tried causes loops in the child processes. I.e. my hrefsArray gets built in the master, and in each subsequent child process.

我确信我将彻底解决这一切，因此，仅向正确的方向轻轻推一下，将不胜感激.

I am sure I am going about this all totally wrong, so would greatly appreciate just general nudge in the right direction.

使用pcntl_fork()提高HTML抓取工具的效率 [英] Improving HTML scraper efficiency with pcntl_fork()

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

使用pcntl_fork()提高HTML抓取工具的效率 [英] Improving HTML scraper efficiency with pcntl_fork()

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭