如何在PhantomJS中进行下一页的抓取 [英] How to go to the next page for scraping in PhantomJS

查看:152
本文介绍了如何在PhantomJS中进行下一页的抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图从一个有几页的网站上获取几个元素。我目前正在使用PhantomJS来完成这项工作,我的代码几乎可以工作,但问题是我的代码在第一页上擦了两次,即使(根据日志)我似乎已经转移到第二页了。

I'm trying to get several elements from a website with several pages. I'm currently using PhantomJS to do that work and my code almost works, but the issue is that my code scrapes twice the first page even if (according to the log) it seems that I already moved to the second one.

以下是代码:

var page = require('webpage').create();
page.viewportSize = { width: 1061, height: 1000 }; //To specify the window size
page.open("website", function () {

    function fetch_names(){
        var name = page.evaluate(function () {
            return [].map.call(document.querySelectorAll('div.pepitesteasermain h2 a'), function(name){
                return name.getAttribute('href');
            });
        });
        console.log(name.join('\n'));
        page.render('1.png');
        window.setTimeout(function (){
            goto_next_page();
        }, 5000);
    }

    function goto_next_page(){
        page.evaluate(function () {
            var a = document.querySelector('#block-system-main .next a');
            var e = document.createEvent('MouseEvents');
            e.initMouseEvent('click', true, true, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null);
            a.dispatchEvent(e);
            waitforload = true;

        });
        fetch_names();
    }

    fetch_names();
});

您可以自己尝试一下,了解所有这些工作原理。

You can try it by yourself to understand how all of that work.

推荐答案

您需要等待页面加载后点击而不是在点击之前移动 setTimeout() fetch_names goto_next_page

You need to wait for the page to load after you click and not before you click by moving setTimeout() from fetch_names to goto_next_page:

function fetch_names(){
    var name = page.evaluate(function () {
        return [].map.call(document.querySelectorAll('div.pepitesteasermain h2 a'), function(name){
            return name.getAttribute('href');
        });
    });
    console.log(name.join('\n'));
    page.render('1.png');
    goto_next_page();
}

function goto_next_page(){
    page.evaluate(function () {
        var a = document.querySelector('#block-system-main .next a');
        var e = document.createEvent('MouseEvents');
        e.initMouseEvent('click', true, true, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null);
        a.dispatchEvent(e);
        waitforload = true;

    });
    window.setTimeout(function (){
        fetch_names();
    }, 5000);
}

请注意,还有很多方法可以等待除静态超时之外的其他内容。相反,您可以

Note that there are many more ways to wait for something other than the static timeout. Instead, you can

page.onLoadFinished = fetch_names;


  • 等待特定选择器出现 waitFor() 来自示例的功能。

    这篇关于如何在PhantomJS中进行下一页的抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆