检索的锚列表已损坏? [英] Retrieved anchors list gets corrupted?

查看:88
本文介绍了检索的锚列表已损坏?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在PhantomJS中分析锚链接(它们的text属性).

I am trying to analyze anchor links ( their text property ) in PhantomJS.

检索发生在这里:

var list = page.evaluate(function() {
  return document.getElementsByTagName('a');
});

这将返回一个具有良好属性length的对象(与在控制台中运行document.getElementsByTagName('a');时得到的相同length相同).但是对象中的绝大多数元素的null值都不好..我不知道为什么会这样.

this will return an object with a property length which is good (the same length I get when running document.getElementsByTagName('a'); in the console). But the vast majority of the elements in the object have the value of null which is not good.. I have no idea why this is happening.

我一直在通过slice转换为实数数组,但效果不好.我尝试过不同的站点,没有区别.我已转储.png文件以验证是否正确加载,并且网站已正确加载.

I have been playing with converting to a real array thru slice which did no good. I have tried different sites, no difference. I have dumped the .png file to verify proper loading and the site is properly loaded.

这显然不是完整的脚本,而是一个最小的脚本,可以在一个知名的公共网站上显示问题;)

This is obviously not the full script, but a minimal script that shows the problem on a well known public site ;)

如何从加载的页面中检索锚的完整列表?

How can I retrieve the full list of anchors from the loaded page ?

var page = require('webpage').create();

page.onError = function(msg, trace) 
{ //Error handling mantra
  var msgStack = ['PAGE ERROR: ' + msg];
  if (trace && trace.length) {
    msgStack.push('TRACE:');
    trace.forEach(function(t) {
      msgStack.push(' -> ' + t.file + ': ' + t.line + (t.function ? ' (in function "' + t.function +'")' : ''));
    });
  }
  console.error(msgStack.join('\n'));
};

phantom.onError = function(msg, trace) 
{ //Error handling mantra
  var msgStack = ['PHANTOM ERROR: ' + msg];
  if (trace && trace.length) {
    msgStack.push('TRACE:');
    trace.forEach(function(t) {
      msgStack.push(' -> ' + (t.file || t.sourceURL) + ': ' + t.line + (t.function ? ' (in function ' + t.function +')' : ''));
    });
  }
  console.error(msgStack.join('\n'));
  phantom.exit(1);
};

function start( url )
{
  page.open( url , function (status)
  {
    console.log( 'Loaded' ,  url , ': ' , status  );
    if( status != 'success' )
      phantom.exit( 0 );

    page.render( 'login.png');

    var list = page.evaluate(function() {
      return  document.getElementsByTagName('a');
    });

    console.log( 'List length: ' , list.length );

    for(  var i = 0 ; i < list.length ; i++ )
    {
     if( !list[i] )
      {
        console.log( i , typeof list[i] ,  list[i] === null , list[i] === undefined );
        //list[i] === null -> true for the problematic anchors
        continue;
      }
      console.log( i,  list[i].innerText , ',' , list[i].text /*, JSON.stringify( list[i] ) */ );
    }
    //Exit with grace
    phantom.exit( 0 );
  });
}    

start( 'http://data.stackexchange.com/' );
//start( 'http://data.stackexchange.com/account/login?returnurl=/' );

推荐答案

phantomjs的当前版本仅允许原始类型(布尔值,字符串,数字,[]{})在页面上下文之间传递.因此,基本上所有功能都将被剥离,这就是DOM元素. t.niese从

The current version of phantomjs permits only primitive types (boolean, string, number, [] and {}) to pass to and from the page context. So essentially all functions will be stripped and that is what DOM elements are. t.niese found the quote from the docs:

注意:参数和评估函数的返回值必须是一个简单的原始对象.经验法则:如果可以通过JSON序列化,那就很好.

Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

关闭,功能,DOM节点等将无法正常工作!

您需要在页面上下文中完成一部分工作.如果要每个节点的innerText属性,则需要首先将其映射到基本类型:

You need to do a part of the work inside of the page context. If you want the innerText property of every node, then you need to map it to a primitive type first:

var list = page.evaluate(function() {
    return Array.prototype.map.call(document.getElementsByTagName('a'), function(a){
        return a.innerText;
    });
});
console.log(list[0]); // innerText

您当然可以同时映射多个属性:

You can of course map multiple properties at the same time:

return Array.prototype.map.call(document.getElementsByTagName('a'), function(a){
    return { text: a.innerText, href: a.href };
});

这篇关于检索的锚列表已损坏?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆