Casperjs使用casper.each迭代链接列表 [英] Casperjs iterating over a list of links using casper.each

查看:113
本文介绍了Casperjs使用casper.each迭代链接列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Casperjs从页面获取链接列表,然后打开每个链接,并从这些页面向数组对象添加特定类型的数据。

I am trying to use Casperjs to get a list of links from a page, then open each of those links, and add to an array object a particular type of data from those pages.

我遇到的问题是在每个列表项上执行的循环。

The problem I am having is with the loop that executes over each of the list items.

首先我得到一个 listOfLinks 来自原始页面。这部分工作和使用长度我可以检查这个列表是否已填充。

First I get a listOfLinks from the original page. This part works and using length I can check that this list is populated.

然而,使用循环语句 this.each 如下所示,没有任何控制台语句出现,casperjs似乎跳过此块。

However, using the loop statement this.each as below, none of the console statements ever show up and casperjs appears to skip over this block.

替换 this.each 使用标准的for循环,执行只会通过第一个链接获得部分,因为语句在object for x.html中创建新数组出现一次,然后代码停止执行。使用IIFE不会改变这一点。

Replacing this.each with a standard for loop, the execution only gets part way through the first link, as the statement "Creating new array in object for x.html" appears once and then the code stops executing. Using an IIFE doesn't change this.

编辑:在详细调试模式下会发生以下情况:

in verbose debugging mode the following happens:

Creating new array object for https://example.com 
[debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true

因此,由于某种原因,传递给thenOpen函数的URL被获取改为空白...

So for some reason the URL that is passed into the thenOpen function gets changed to blank...

我觉得Casperjs的异步性质让我感到不舒服,我很感激被指向一个有效的例子。

I feel like there is something about Casperjs's asynchronous nature that I am not grasping here, and would be grateful to be pointed towards a working example.

casper.then(function () {

  var date = Date.now();
  console.log(date);

  var object = {};
  object[date] = {}; // new object for date

  var listOfLinks = this.evaluate(function(){
    console.log("getting links");
    return document.getElementsByClassName('importantLink');
  });

  console.log(listOfLinks.length);

  this.each(listOfLinks, function(self, link) {

    var eachPageHref = link.href;

    console.log("Creating new array in object for " + eachPageHref);

    object[date][eachPageHref] = []; // array for page to store names

    self.thenOpen(eachPageHref, function () {

      var listOfItems = this.evaluate(function() {
        var items = [];
        // Perform DOM manipulation to get items
        return items;
      });
    });

    object[date][eachPageHref] = items;

  });
  console.log(JSON.stringify(object));

});


推荐答案

我决定使用自己的Stackoverflow.com作为演示站点以运行您的脚本。我在代码中纠正了一些小问题,结果就是从PhantomJS赏金问题中获得评论。

I decided to use our own Stackoverflow.com as a demo site to run your script against. There were a few minor things I've corrected in your code and the result is this exercise in getting comments from PhantomJS bounty questions.

var casper = require('casper').create();

casper
.start()
.open('http://stackoverflow.com/questions/tagged/phantomjs?sort=featured&pageSize=30')
.then(function () {

    var date = Date.now(), object = {};
    object[date] = {};

    var listOfLinks = this.evaluate(function(){

        // Getting links to other pages to scrape, this will be 
        // a primitive array that will be easily returned from page.evaluate
        var links = [].map.call(document.querySelectorAll("#questions .question-hyperlink"), function(link) {
          return link.href;
        });    
        return links;
    });

    // Now to iterate over that array of links
    this.each(listOfLinks, function(self, eachPageHref) {

        object[date][eachPageHref] = []; // array for page to store names

        self.thenOpen(eachPageHref, function () {

            // Getting comments from each page, also as an array
            var listOfItems = this.evaluate(function() {
                var items = [].map.call(document.getElementsByClassName("comment-text"), function(comment) {
                    return comment.innerText;
                });    
                return items;
            });
            object[date][eachPageHref] = listOfItems;
        });
    });

    // After each links has been scraped, output the resulting object
    this.then(function(){
        console.log(JSON.stringify(object));
    });
})

casper.run();

更改内容: page.evaluate now返回简单数组,这是casper.each()正确迭代所需的。 href 在page.evaluate中立即提取属性。此修正:

What is changed: page.evaluate now returns simple arrays, which are needed for casper.each() to correctly iterate. href attributes are extracted right away in page.evaluate. Also this correction:

 object[date][eachPageHref] = listOfItems; // previously assigned items which were undefined in this scope

脚本运行的结果是

{"1478596579898":{"http://stackoverflow.com/questions/40410927/phantomjs-from-node-on-windows":["en.wikipedia.org/wiki/File_URI_scheme – Igor 2 days ago\n","@Igor is there something in particular you see wrong, or are you suggesting the phantom module has an incorrect URI? – Danny Buonocore 2 days ago\n","Probably windows security issue not allowing to run an unsigned program. – Vaviloff yesterday\n"],"http://stackoverflow.com/questions/40412726/casperjs-iterating-over-a-list-of-links-using-casper-each":["Thanks, this looked really promising. I made the changes but it didn't solve the problem. And I just realised that in debug mode the following happens: Creating new array object for https://example.com [debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true and then Casperjs silently fails. It seems that the correct link that gets passed into thenOpen gets changed to about:blank... – cyc665 yesterday\n"]}}

这篇关于Casperjs使用casper.each迭代链接列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆