PhantomJS使用太多线程 [英] PhantomJS using too many threads

查看:92
本文介绍了PhantomJS使用太多线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个PhantomJS应用程序来抓取我构建的网站并检查要包含的JavaScript文件。 JavaScript类似于Google,其中一些内联代码加载到另一个JS文件中。该应用程序查找其他JS文件,这就是我使用Phantom的原因。

I wrote a PhantomJS app to crawl over a site I built and check for a JavaScript file to be included. The JavaScript is similar to Google where some inline code loads in another JS file. The app looks for that other JS file which is why I used Phantom.

预期结果是什么?

控制台输出应读取大量的URL,然后判断脚本是否已加载。

The console output should read through a ton of URLs and then tell if the script is loaded or not.

实际发生了什么?

控制台输出将按预期读取大约50个请求,然后才开始吐出此错误:

The console output will read as expected for about 50 requests and then just start spitting out this error:

2013-02-21T10:01:23 [FATAL] QEventDispatcherUNIXPrivate(): Can not continue without a thread pipe
QEventDispatcherUNIXPrivate(): Unable to create thread pipe: Too many open files

这是打开页面并搜索脚本的代码块包括:

This is the block of code that opens a page and searches for the script include:

page.open(url, function (status) {
    console.log(YELLOW, url, status, CLEAR);
    var found =  page.evaluate(function () {
      if (document.querySelectorAll("script[src='***']").length) {
        return true;
      } else { return false; }
    });

    if (found) {
      console.log(GREEN, 'JavaScript found on', url, CLEAR);
    } else {
      console.log(RED, 'JavaScript not found on', url, CLEAR);
    }
    self.crawledURLs[url] = true;
    self.crawlURLs(self.getAllLinks(page), depth-1);
  });

crawledURLs对象只是我已经抓取的网址对象。 crawlURLs函数只是遍历来自getAllLinks函数的链接,并在具有爬虫开始的域的基本域的所有链接上调用open函数。

The crawledURLs object is just an object of urls that I've already crawled. The crawlURLs function just goes through the links from the getAllLinks function and calls the open function on all links that have the base domain of the domain that the crawler started on.

编辑

我修改了代码的最后一个块,如下所示,但仍有相同的问题。我已将page.close()添加到文件中。

I modified the last block of the code to be as follows, but still have the same issue. I have added page.close() to the file.

if (!found) {
  console.log(RED, 'JavaScript not found on', url, CLEAR);
}
self.crawledURLs[url] = true;
var links = self.getAllLinks(page);
page.close();
self.crawlURLs(links, depth-1);


推荐答案

来自文档:


由于某些技术限制,网页对象可能不会完全被垃圾收集。当反复使用同一个对象时经常会遇到这种情况。

Due to some technical limitations, the web page object might not be completely garbage collected. This is often encountered when the same object is used over and over again.

解决方案是显式调用在适当的时间关闭()网页对象(在许多情况下是页面)。

The solution is to explicitly call close() of the web page object (i.e. page in many cases) at the right time.

一些包含的示例,例如 follow.js ,演示显式关闭的多个页面对象。

Some included examples, such as follow.js, demonstrate multiple page objects with explicit close.

这篇关于PhantomJS使用太多线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆