当我打开太多页面并忽略最后一个URL时,PhantomJS崩溃 [英] PhantomJS crashes when I open too many pages and ignores the last URL

查看:150
本文介绍了当我打开太多页面并忽略最后一个URL时,PhantomJS崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

系统:Windows 8.1 64位元版本,带有来自主页的2.0版的二进制文件



我有一个.txt文件,每行1个URL,我阅读了每一行并打开在该页面上搜索特定的url.match(出于隐私原因更改了域名)-如果找到,则打印找到的JSON,中止请求,卸载页面。
我的.txt文件包含12500个链接,出于测试目的,我将其拆分为前10/100/500个网址。



问题1:如果我尝试10个网址,它会打印9并在以后使用40-50%的CPU



问题2:如果我尝试100个网址,它会打印98,然后出于任何原因使用40-50%的cpu,然后在2-3分钟后崩溃。



问题3:对于98个链接(打印96,使用40-50%的cpu,然后也会崩溃)和500个链接也是如此



TXT文件:
https://www.dropbox.com/s/eeiy12ku5k15226/sitemaps.7z?dl = 1



98、100和500链接的崩溃转储:
https://www.dropbox.com/s/ilvbg8lv1bizjti/Crash%20dumps.7z?dl=1

  console.log('Hello,world!'); 
var fs = require(’fs’);
var stream = fs.open('100sitemap.txt','r');
var line = stream.readLine();
var webPage = require(网页);
var i = 1;

while(!stream.atEnd()|| line!=){
//console.log(line);
var page = webPage.create();
page.settings.loadImages = false;
page.open(line,function(){});
//console.log(\"opened + line);
page.onResourceRequested = function(requestData,request){
//console.log(\"BEFORE: + requestData.url);
var match = requestData.url.match(/example.com\/ac/g)
//console.log(\"Match: + match);
//console.log(\"Line: + line);
//console.log(\"Match: + match);
if(match!= null){
var targetString =解码URI(JSON.stringify(requestData.url));
var klammerauf = targetString.indexOf( {);
var jsonobjekt = targetString.substr(klammerauf,(targetString.indexOf(})-klammerauf)+ 1);;
targetJSON =(decodeURIComponent(jsonobjekt));
console.log(i);
i ++;
console.log(targetJSON);
console.log();
request.abort();
page.close();
}
};
var line = stream.readLine();
}

//console.log(文件已关闭);
//stream.close();


解决方案

并发请求



您真的不应该在循环中加载页面,因为循环是一个同步结构,而 page.open()是异步的。这样做,您会遇到内存消耗激增的问题,因为所有URL都同时打开。如果列表中包含20个或更多URL,这将是一个问题。



功能级范围



另一个问题在于JavaScript具有功能级别范围。这意味着即使您在 while 块内定义了 page 变量,该变量也可以在全球使用。由于它是全局定义的,因此您会遇到PhantomJS的异步特性的问题。 page.onResourceRequested 函数定义内的页面很可能与页面用于打开触发回调的URL。在此处中了解更多信息。常见的解决方案是使用IIFE将 page 变量绑定到仅一个迭代,但是您需要重新考虑整个方法。



内存泄漏



您还存在内存泄漏,因为当 page.onResourceRequested 事件不匹配,您没有终止请求,也没有清理页面实例。您可能想对所有URL进行操作,而不仅仅是对与您特定正则表达式匹配的URL进行操作。



易于修复



<一个快速的解决方案是定义一个函数,该函数执行一次迭代并在当前迭代完成时调用下一个迭代。您还可以将一个页面实例重新用于所有请求。

  var page = webPage.create(); 

函数runOnce(){
if(stream.atEnd()){
phantom.exit();
的回报;
}
var url = stream.readLine();
if(url ===){
phantom.exit();
的回报;
}

page.open(url,function(){});

page.onResourceRequested = function(requestData,request){
/**...**/

request.abort();

runOnce();
};
}

runOnce();


System: Windows 8.1 64bit with binary from the main page, version 2.0

I have a .txt file with 1 URL per line, I read every line and open the page, searching for a specific url.match (changed domain for privacy reasons in the code) - if found,print the found JSON, abort request, unload page. My .txt file contains 12500 links, for testing purpose I split it into the first 10/100/500 urls.

Problem 1: If I try 10 urls, it prints 9 and uses 40-50% cpu afterwards

Problem 2: If I try 100 urls, it prints 98, uses 40-50% cpu afterwards for whatever reasons, then it crashes after 2-3 minutes.

Problem 3: Same goes for 98 links (it prints 96, uses 40-50% cpu, then crashes too) and for 500 links

TXT-files: https://www.dropbox.com/s/eeiy12ku5k15226/sitemaps.7z?dl=1

Crash dumps for 98, 100 and 500 links: https://www.dropbox.com/s/ilvbg8lv1bizjti/Crash%20dumps.7z?dl=1

console.log('Hello, world!');
var fs = require('fs');
var stream = fs.open('100sitemap.txt', 'r');
var line = stream.readLine();
var webPage = require('webpage');
var i = 1;

while(!stream.atEnd() || line != "") {
     //console.log(line);
    var page = webPage.create();
    page.settings.loadImages = false;
    page.open(line, function() {});
    //console.log("opened " + line);
    page.onResourceRequested = function(requestData, request) {
        //console.log("BEFORE: " +requestData.url);
        var match = requestData.url.match(/example.com\/ac/g)
        //console.log("Match: " + match);
        //console.log("Line: " + line);
        //console.log("Match: " + match);
        if (match != null) {
            var targetString = decodeURI(JSON.stringify(requestData.url));
            var klammerauf = targetString.indexOf("{");
            var jsonobjekt = targetString.substr(klammerauf,   (targetString.indexOf("}") - klammerauf) + 1);
            targetJSON = (decodeURIComponent(jsonobjekt));
            console.log(i);
            i++;
            console.log(targetJSON);
            console.log("");
            request.abort();
            page.close();
        }
    };
    var line = stream.readLine();
}

//console.log("File closed");
//stream.close();

解决方案

Concurrent Requests

You really shouldn't be loading pages in a loop, because a loop is a synchronous construct whereas page.open() is asynchronous. Doing so, you will experience the problem that memory consumption sky-rockets, because all URLs are opening at the same time. This will be a problem with 20 or more URLs in the list.

Function-level scope

The other problem is that JavaScript has function level scope. That means that even when you define the page variable inside of the while block it is available globally. Since it is defined globally, you get a problem with the asynchronous nature of PhantomJS. The page inside of the page.onResourceRequested function definition is very likely not the same page that was used to open a URL which triggered the callback. See more on that here. A common solution would to use an IIFE to bind the page variable to only one iteration, but you need to rethink your whole approach.

Memory-leak

You also have a memory-leak, because when the URL in the page.onResourceRequested event doesn't match, you're not aborting the request and not cleaning the page instance up. You probably want to do that for all URLs and not just the ones that match your specific regex.

Easy fix

A fast solution would be to define a function that does one iteration and call the next iteration when the current one finished. You can also re-use one page instance for all requests.

var page = webPage.create();

function runOnce(){
    if (stream.atEnd()) {
        phantom.exit();
        return;
    }
    var url = stream.readLine();
    if (url === "") {
        phantom.exit();
        return;
    }

    page.open(url, function() {});

    page.onResourceRequested = function(requestData, request) {
        /**...**/

        request.abort();

        runOnce();
    };
}

runOnce();

这篇关于当我打开太多页面并忽略最后一个URL时,PhantomJS崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆