Phantomjs page.content 未检索页面内容 [英] Phantomjs page.content isn't retrieving the page content

查看:41
本文介绍了Phantomjs page.content 未检索页面内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Phantomjs 抓取使用 JavaScript 和 Ajax 加载动态内容的网站.
我有以下代码:

I use Phantomjs to scrape websites that use JavaScript and Ajax to load dynamic content.
I have the following code:

  var page = require('webpage').create();
        page.onError = function(msg, trace) {
            var msgStack = ['ERROR: ' + msg];
            if (trace && trace.length) {
                msgStack.push('TRACE:');
                trace.forEach(function(t) {
                    msgStack.push(' -> ' + t.file + ': ' + t.line + (t.function ? ' (in function "' + t.function +'")' : ''));
                });
            }
            console.error(msgStack.join('
'));
        };
        page.onConsoleMessage = function(msg, lineNum, sourceId) {
            console.log('CONSOLE: ' + msg + ' (from line #' + lineNum + ' in "' + sourceId + '")');
        };
        page.open('http://www.betexplorer.com/soccer/germany/oberliga-bayern-sud/wolfratshausen-unterhaching-ii/x8rBMAB8/', function () {
            console.log(page.content);
            phantom.exit();
        });   

问题是这段代码没有检索到我想要的源代码.
如果通过网页浏览器(如chrome)输入网址并阅读页面的源代码(动态源代码,在进行JavaScript和Ajax调用后),您将看到网络浏览器源代码和Phantomjs源代码完全不同.
但在这种情况下,我需要网络浏览器的源代码.
通常这个 Phantomjs 代码会检索我需要的源代码,但在这个 url(任何其他)的情况下,Phantomjs 不会检索正确的源代码.
我假设 Phantomjs 不知道如何处理将动态内容加载到此页面的 JavaScript 和 Ajax 调用.
运行代码时出现这些错误:

The problem is that this code doesn't retrieve the source code i want.
If you enter the URL through a web browser(like chrome) and read the source code(the dynamic source code, after the JavaScript and Ajax calls were made) of the page, you will see that the web browser source code and the Phantomjs source code are completely different.
But in this case i need the web browsers source code.
Usually this Phantomjs code retrieves the source code i need, but in the case of this url(any many others) Phantomjs does not retrieve the correct source code.
I assume Phantomjs doesn't know how to handle the JavaScript and Ajax calls that load dynamic content into this page.
I get these errors when i run the code:

ERROR: TypeError: 'undefined' is not a function (evaluating 'function(e){
        this.pointer.x = e.pageX;
        this.pointer.y = e.pageY;
    }.bind(this)')
TRACE:
 -> http://www.betexplorer.com/gres/tooltip.js?serial=1410131213: 207
 -> http://www.betexplorer.com/gres/tooltip.js?serial=1410131213: 157
 -> http://www.betexplorer.com/gres/tooltip.js?serial=1410131213: 310 (in function "tooltip")
 -> http://www.betexplorer.com/soccer/germany/oberliga-bayern-sud/wolfratshausen-unterhaching-ii/x8rBMAB8/: 291
 -> http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js: 2
 -> http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js: 2
 -> http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js: 2
 -> http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js: 2
CONSOLE: Invalid App Id: Must be a number or numeric string representing the application id. (from line #undefined in "undefined")
CONSOLE: FB.getLoginStatus() called before calling FB.init(). (from line #undefined in "undefined") 

那么我如何获得这个页面的动态源代码(http://www.betexplorer.com/soccer/germany/oberliga-bayern-sud/wolfratshausen-unterhaching-ii/x8rBMAB8/) 使用 Phantomjs?

So how do i get the dynamic source code of this page(http://www.betexplorer.com/soccer/germany/oberliga-bayern-sud/wolfratshausen-unterhaching-ii/x8rBMAB8/) using Phantomjs?

推荐答案

由于页面是动态生成的,所以需要稍等片刻,才能访问到想要的页面源.

Since the page is dynamically generated, you need to wait a little before you can access the intended page source.

page.open('http://www.betexplorer.com/soccer/germany/oberliga-bayern-sud/wolfratshausen-unterhaching-ii/x8rBMAB8/', function () {
    setTimeout(function(){
        console.log(page.content);
        phantom.exit();
    }, 5000); // 5 sec should be enough
});

TypeError: 'undefined' is not a function 指的是 bind,因为 PhantomJS 1.x 不支持它.PhantomJS 1.x 使用 QtWebkit 的旧分支,可与 Chrome 13 或 Safari 5 相媲美.较新的 PhantomJS 2 使用支持 bind 的较新引擎.如果您仍然使用 1.x 版,您需要在 <代码>page.onInitialized 事件处理程序:

The TypeError: 'undefined' is not a function refers to bind, because PhantomJS 1.x doesn't support it. PhantomJS 1.x uses an old fork of QtWebkit which is comparable to Chrome 13 or Safari 5. The newer PhantomJS 2 uses a newer engine which supports bind. If you still use version 1.x you need to add a shim inside of the page.onInitialized event handler:

page.onInitialized = function(){
    page.evaluate(function(){
        var isFunction = function(o) {
          return typeof o == 'function';
        };

        var bind,
          slice = [].slice,
          proto = Function.prototype,
          featureMap;

        featureMap = {
          'function-bind': 'bind'
        };

        function has(feature) {
          var prop = featureMap[feature];
          return isFunction(proto[prop]);
        }

        // check for missing features
        if (!has('function-bind')) {
          // adapted from Mozilla Developer Network example at
          // https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/bind
          bind = function bind(obj) {
            var args = slice.call(arguments, 1),
              self = this,
              nop = function() {
              },
              bound = function() {
                return self.apply(this instanceof nop ? this : (obj || {}), args.concat(slice.call(arguments)));
              };
            nop.prototype = this.prototype || {}; // Firefox cries sometimes if prototype is undefined
            bound.prototype = new nop();
            return bound;
          };
          proto.bind = bind;
        }
    });
};

摘自我的回答此处.

这篇关于Phantomjs page.content 未检索页面内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆