PhantomJS open()太慢了 [英] PhantomJS open() too slow

查看:88
本文介绍了PhantomJS open()太慢了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在NodeJS中遇到网页报废问题,我想从远程网页上获取一些数据但是数据是从javascript插入到html中的。我开始使用PhantomJS,除了阻止我完成工作的一件事之外它很有效。 PhantomJS工作太慢,这段代码需要大约14秒才能执行!?

I'm having a problem with web scrapping in NodeJS, i want to take some data from the remote webpage but the data is inserted into html from the javascript. I started to use PhantomJS and it works great except one thing that preventing me to finish my job. PhantomJS is working too slow, this snippet of code needs about 14 seconds to execute!?

var page = require('webpage').create();
page.open('https://www.halooglasi.com/nekretnine/izdavanje-stanova/novi-beograd---novi-merkator-id19270/5425485514649', function () {
    phantom.exit();
    });

请求库只需更快地返回原始数据,稍微多一点,所以phantomJS是再工作13秒左右。看起来像PhantomJS做了很多不需要的操作,我不需要,我不需要渲染图片视频或任何我只需要javascript执行所以我可以使用cheerio从html获取数据。你能告诉我如何加速PhantomJS或者使用其他更快的webkit来满足我的需求吗?

with request library who just returns raw data its much faster, a little more than a second so phantomJS is working another 13 seconds or so. It looks like PhantomJS is doing a lot of unnecesary operations which i dont need, i dont need to render pictures videos or anything i just need javascript to execute so i can use cheerio to get the data from html. Can you tell me how to speed up PhantomJS or maybe use some other faster webkit for my needs?

推荐答案

你有几个措施吗?可以减少处理时间。

There are several measures you can take to decrease processing time.

1。获得更强大的服务器/计算机(正如Mathieu正确指出的那样)

1 . Get a more powerful server/computer (as Mathieu rightly noted).

是的,你可能会认为这与这个问题无关,但是在非常麻烦的问题上。在没有优化的预算$ 8 VPS下,您的初始脚本运行 9589ms ,这已经提高了约30%。

Yes, you could argue this is irrelevant to the question, but in matters of scraping it very much is. On a budget $8 VPS without optimization your initial script ran for 9589ms which is already a ~30% improvement.

2。关闭图像加载。它会有所帮助...... 8160ms 加载时间。

2 . Turn off images load. It will help... a bit. 8160ms load time.

page.settings.loadImages = false;  

3。分析页面,查找和取消不必要的网络请求。

3 . Analyze the page, find and cancel unnecessary network requests.

即使在像谷歌浏览器这样的普通浏览器中,网站也会加载缓慢:使用AdblockPlus加载129个请求/ 8.79秒。 有很多请求(gif,1Mb),如果是第三方网站,则很多喜欢facebook,twitter(获取小部件)和广告网站。

Even in a normal browser like Google Chrome the site loads slowly: 129 requests/8.79s load time with AdblockPlus. There are a lot of requests (gif, 1Mb), many if them are for third-party sites like facebook, twitter (to fetch widgets) and to ad sites.

我们也可以取消它们:

block_urls = ['gstatic.com', 'adocean.pl', 'gemius.pl', 'twitter.com', 'facebook.net', 'facebook.com', 'planplus.rs'];

page.onResourceRequested = function(requestData, request){
    for(url in block_urls) {
        if(requestData.url.indexOf(block_urls[url]) !== -1) {
            request.abort();
            console.log(requestData.url + " aborted");
            return;
        }
    }   
}

我现在的加载时间页面加载并可用时只需 4393ms PhantomJS截图

The load time for me now is just 4393ms while the page is loaded and usable: PhantomJS screenshot

如果不修改页面代码,我认为不能做更多的事情,因为根据页面来源判断它是非常糟糕的脚本。

I don't think much more can be done without tinkering with page's code because judging by the page source it is quite script-heavy.

整个代码:

var page = require('webpage').create();
var fs = require("fs");

// console.time polyfill from https://github.com/callmehiphop/console-time
;(function( console ) {
  var timers;
  if ( !console ) {
    return;
  }
  timers = {};
  console.time = function( name ) {
    if ( name ) {
      timers[ name ] = Date.now();
    }
  };
  console.timeEnd = function( name ) {
    if ( timers[ name ] ) {
      console.log( name + ': ' + (Date.now() - timers[ name ]) + 'ms' );
      delete timers[ name ];
    }
  };
}( window.console ));

console.time("open");

page.settings.loadImages = false;
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36';
page.viewportSize = {
  width: 1280,
  height: 800
};

block_urls = ['gstatic.com', 'adocean.pl', 'gemius.pl', 'twitter.com', 'facebook.net', 'facebook.com', 'planplus.rs'];
page.onResourceRequested = function(requestData, request){
    for(url in block_urls) {
        if(requestData.url.indexOf(block_urls[url]) !== -1) {
            request.abort();
            console.log(requestData.url + " aborted");
            return;
        }
    }            
}

page.open('https://www.halooglasi.com/nekretnine/izdavanje-stanova/novi-beograd---novi-merkator-id19270/5425485514649', function () {
    fs.write("longload.html", page.content, 'w');

    console.timeEnd("open");

    setTimeout(function(){
        page.render('longload.png');
        phantom.exit();
    }, 3000);

});

这篇关于PhantomJS open()太慢了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆