PHP 中有关 Web Crawler 的错误 [英] Errors regarding Web Crawler in PHP

查看:51
本文介绍了PHP 中有关 Web Crawler 的错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 PHP 创建一个简单的网络爬虫,它能够爬取 .edu 域,前提是父级的种子网址.

I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.

爬虫的实现我用的是简单的html dom,部分核心逻辑是我自己实现的.

I have used simple html dom for implementing the crawler while some of the core logic is implemented by me.

我将在下面发布代码并尝试解释问题.

I am posting the code below and will try to explain the problems.

private function initiateChildCrawler($parent_Url_Html) {

    global $CFG;
    static $foundLink;
    static $parentID;
    static $urlToCrawl_InstanceOfChildren;

    $forEachCount = 0;
    foreach($parent_Url_Html->getHTML()->find('a') as $foundLink) 
    {
        $forEachCount++;
        if($forEachCount<500) {
        $foundLink->href = url_to_absolute($parent_Url_Html->getURL(), $foundLink->href);

        if($this->validateEduDomain($foundLink->href)) 
        {
            //Implement else condition later on
            $parentID = $this->loadSaveInstance->parentExists_In_URL_DB_CRAWL($this->returnParentDomain($foundLink->href));
            if($parentID != FALSE) 
            {
                if($this->loadSaveInstance->checkUrlDuplication_In_URL_DB_CRAWL($foundLink->href) == FALSE)
                {
                    $urlToCrawl_InstanceOfChildren = new urlToCrawl($foundLink->href);
                    if($urlToCrawl_InstanceOfChildren->getSimpleDomSource($CFG->finalContext)!= FALSE)
                    {
                        $this->loadSaveInstance->url_db_html($urlToCrawl_InstanceOfChildren->getURL(), $urlToCrawl_InstanceOfChildren->getHTML());
                        $this->loadSaveInstance->saveCrawled_To_URL_DB_CRAWL(NULL, $foundLink->href, "crawled", $parentID);

                        /*if($recursiveCount<1)
                        {
                            $this->initiateChildCrawler($urlToCrawl_InstanceOfChildren);
                        }*/
                    }
                }
            }
        }
        }
    }   
}

现在您可以看到initialChildCrawler 正在被initialParentCrawler 函数调用,该函数将父链接传递给子爬虫.父链接示例:www.berkeley.edu,爬虫将在其主页上找到所有链接并返回其所有 html 内容.这种情况会一直持续到种子网址用完为止.

Now as you can see that initiateChildCrawler is being called by initiateParentCrawler function which passes the parent link to the child crawler. Example of parent link: www.berkeley.edu for which the crawler will find all the links on its main page and return all its html content. This happens until the seed urls are exhausted.

例如:1-harvard.edu ->>>>> 将找到所有链接并返回它们的 html 内容(通过调用 childCrawler).移动到 parentCrawler 中的下一个父级.2-berkeley.edu ->>>>> 将找到所有链接并返回它们的 html 内容(通过调用 childCrawler).

for eg: 1-harvard.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler). Moves to the next parent in parentCrawler. 2-berkeley.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).

其他功能不言自明.

现在的问题:childCrawler为每个链接完成foreach循环后,函数无法正常退出.如果我从 CLI 运行脚本,CLI 会崩溃.在浏览器中运行脚本会导致脚本终止.

Now the problem: After the childCrawler completes the foreach loop for each link, the function is unable to exit properly. If I am running the script from CLI, the CLI crashes. While running the script in the browser causes the script to terminate.

但是,如果我将抓取子链接的限制设置为 10 个或更少(通过更改 $forEachCount 变量),则抓取工具开始正常工作.

But if I set the limit of crawling child Links to 10 or something less (by altering the $forEachCount variable), the crawler starts working fine.

请在这方面帮助我.

来自 CLI 的消息:

Message from CLI:

问题签名:问题事件名称:APPCRASH应用程序名称:php-cgi.exe应用版本:5.3.8.0申请时间戳:4e537939故障模块名称:php5ts.dll故障模块版本:5.3.8.0故障模块时间戳:4e537a04异常代码:c0000005异常偏移:0000c793操作系统版本:6.1.7601.2.1.0.256.48区域设置 ID:1033附加信息 1:0a9e附加信息 2:0a9e372d3b4ad19135b953a78882e789附加信息 3:0a9e附加信息4:0a9e372d3b4ad19135b953a78882e789

Problem signature: Problem Event Name: APPCRASH Application Name: php-cgi.exe Application Version: 5.3.8.0 Application Timestamp: 4e537939 Fault Module Name: php5ts.dll Fault Module Version: 5.3.8.0 Fault Module Timestamp: 4e537a04 Exception Code: c0000005 Exception Offset: 0000c793 OS Version: 6.1.7601.2.1.0.256.48 Locale ID: 1033 Additional Information 1: 0a9e Additional Information 2: 0a9e372d3b4ad19135b953a78882e789 Additional Information 3: 0a9e Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

推荐答案

Flat Loop Example:

Flat Loop Example:

  1. 您使用包含您希望首先处理的所有网址的堆栈启动循环.
  2. 在循环内部:
  1. 移动堆栈中的第一个 URL(您获取它并删除它).
  2. 如果您发现新的 URL,则将它们添加到堆栈的末尾 (push).
  1. You shift the first URL (you obtain it and it's removed) from the stack.
  2. If you find new URLs, you add them at the end of the stack (push).

这将一直运行,直到处理完堆栈中的所有 URL,因此您添加(正如您已经为 foreach 以某种方式已经)一个计数器以防止它运行太长时间:

This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach) a counter to prevent this from running for too long:

$URLStack = (array) $parent_Url_Html->getHTML()->find('a');
$URLProcessedCount = 0;
while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs
{
    $url = array_shift($URLStack);
    if (!$url) break; # exit if the stack is empty

    # process URL

    # for each new URL:
    $URLStack[] = $newURL;
}

您可以通过不向堆栈中已经存在的 URL 添加 URL 来使其更加智能,但是您只需要向堆栈中插入绝对 URL.但是,我强烈建议您这样做,因为不需要再次处理您已经获得的页面(例如,每个页面可能都包含一个指向主页的链接).如果你想这样做,只需增加循环内的 $URLProcessedCount ,这样你就可以保留以前的条目:

You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you've already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount inside the loop so you keep previous entries as well:

while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs
{
    $url = $URLStack[$URLProcessedCount++];

另外,我建议你使用 PHP DOMDocument 扩展而不是简单的 dom,因为它是一个更通用的工具.

Additionally I suggest you use the PHP DOMDocument extension instead of simple dom as it's a much more versatile tool.

这篇关于PHP 中有关 Web Crawler 的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆