刮擦无限滚动页面会停止而不滚动 [英] Scraping an infinite scroll page stops without scrolling

查看:82
本文介绍了刮擦无限滚动页面会停止而不滚动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在与PhantomJS和CasperJS合作,以搜索网站中的链接。该网站使用javascript动态加载结果。但是下面的代码片段并没有让我获得该页面包含的所有结果。我需要的是向下滚动到页面底部,查看微调器是否显示(意味着还有更多内容),等待新内容加载后再继续滚动直到不再显示新内容。然后在类数组中存储类名为 .title 的链接。链接到网页进行抓取。

I am currently working with PhantomJS and CasperJS to scrape for links in a website. The site uses javascript to dynamically load results. The below snippet however is not getting me all the results the page contains. What I need is to scroll down to the bottom of the page, see if the spinner shows up (meaning there’s more content still to come), wait until the new content had loaded and then keep scrolling until no more new content was shown. Then store the links with class name .title in an array. Link to the webpage for scraping.

var casper = require('casper').create();
var urls = [];
function tryAndScroll(casper) {
  casper.waitFor(function() {
    this.page.scrollPosition = { top: this.page.scrollPosition["top"] + 4000, left: 0 };
    return true;
  }, function() {
    var info = this.getElementInfo('.badge-post-grid-load-more');
    if (info["visible"] == true) {
      this.waitWhileVisible('.badge-post-grid-load-more', function () {
        this.emit('results.loaded');
      }, function () {
        this.echo('next results not loaded');
      }, 5000);
    }
  }, function() {
    this.echo("Scrolling failed. Sorry.").exit();
  }, 500);
}

casper.on('results.loaded', function () {
  tryAndScroll(this);
});

casper.start('http://example.com/', function() {
    this.waitUntilVisible('.title', function() {
        tryAndScroll(this);
      });
});

casper.then(function() {
  casper.each(this.getElementsInfo('.title'), function(casper, element, j) {
    var url = element["attributes"]["href"];
    urls.push(url);
  });
});

casper.run(function() {
    this.echo(urls.length + ' links found:');
    this.echo(urls.join('\n')).exit();
});


推荐答案

我看过这个页面。您的误解可能是您认为加载下一个元素后, .badge-post-grid-load-more 元素就会消失。不是这种情况。它根本没有变化。您必须找到另一种方法来测试是否将新元素放入DOM中。

I've looked at the page. Your misconception is probably that you think the .badge-post-grid-load-more element vanishes as soon as the next elements are loaded. This is not the case. It doesn't change at all. You have to find another way to test whether new elements were put into the DOM.

例如,您可以检索当前元素数并使用 waitFor 检测数字何时发生变化。

You could for example retrieve the current number of elements and use waitFor to detect when the number changes.

function getNumberOfItems(casper) {
    return casper.getElementsInfo(".listview .badge-grid-item").length;
}

function tryAndScroll(casper) {
  casper.page.scrollPosition = { top: casper.page.scrollPosition["top"] + 4000, left: 0 };
  var info = casper.getElementInfo('.badge-post-grid-load-more');
  if (info.visible) {
    var curItems = getNumberOfItems(casper);
    casper.waitFor(function check(){
      return curItems != getNumberOfItems(casper);
    }, function then(){
      tryAndScroll(this);
    }, function onTimeout(){
      this.echo("Timout reached");
    }, 20000);
  } else {
    casper.echo("no more items");
  }
}

我还精简了 tryAndScroll 一点点。有完全不必要的功能:第一个 casper.waitFor 根本没有等待,因为 onTimeout 回调永远不会被援引。

I've also streamlined tryAndScroll a little. There were completely unnecessary functions: the first casper.waitFor wasn't waiting at all and because of that the onTimeout callback could never be invoked.

这篇关于刮擦无限滚动页面会停止而不滚动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆