如何使用Phantomjs设置页面抓取之间的时间间隔 [英] How to set time interval between page scraping with Phantomjs

查看:69
本文介绍了如何使用Phantomjs设置页面抓取之间的时间间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我使用Phantomjs编写了一个脚本,该脚本可刮擦多个页面.我的脚本有效,但我不知道如何在两次刮擦之间设置时间间隔.我尝试使用 setInterval 并大约每5秒从 arrayList 中传递项目,但似乎不起作用.我的剧本不断.这是我的phantomjs脚本代码示例:

Currently I wrote a script with Phantomjs that scrapes through multiple pages. My script works but I can't figure out how to set a time interval in between scrapes. I tried using setInterval and passing the items from the arrayList about every 5 seconds but it doesn't seem to work. My script keeps breaking. Here's my example phantomjs script code:

没有 setInterval

var arrayList = ['string1', 'string2', 'string3'....]

arrayList.forEach(function(eachItem) {
    var webAddress = "http://www.example.com/eachItem"    
    phantom.create(function(ph) {
    return ph.createPage(function(page) {

        return page.open(yelpAddress, function(status) {
            console.log("opened site? ", status);


            page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {

                setTimeout(function() {
                    return page.evaluate(function() {

                        //code here for gathering data


                    }, function(result) {
                        return result
                        ph.exit();
                    });

                }, 5000);

            });
        });
    });
});

使用 setInterval :

var arrayList = ['string1', 'string2', 'string3'....]
var i = 0
var scrapeInterval = setInterval(function() {
    var webAddress = "http://www.example.com/arrayList[i]"    
    phantom.create(function(ph) {
    return ph.createPage(function(page) {

        return page.open(yelpAddress, function(status) {
            console.log("opened site? ", status);


              page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {

                setTimeout(function() {
                    return page.evaluate(function() {

                           //code here for gathering data


                    }, function(result) {
                           return result
                           ph.exit();
                    });

                }, 5000);

            });
        });
    });
    i++
    if(i > arrayList.length) {
    clearInterval(scrapeInterval);        
}, 5000);

基本上,我想在 arrayList 中发送一大堆项目(其中10-20个),等待1-2分钟,然后发送下一个项目,而不会占用网站太多.或者是否可以设置一种时间间隔,以每2-3秒循环遍历数组中的每个项目.

Basically, I would like to send in a chunk of itemss (10-20 of them) within the arrayList and wait for 1 - 2 mins and send in next chunk of items without overwhelming the website. OR if there a way to set a time interval to loop through each item within the array every 2-3 secs.

推荐答案

问题是PhantomJS是异步的,但循环迭代不是.(在第一个代码段中)所有迭代都在加载第一页之前执行.实际上,您正在生成同时运行的多个此类进程.

The problem is that PhantomJS is asynchronous, but loop iteration is not. All iterations (in the first snippet) are executed even before the first page is loaded. You're essentially generating multiple such processes which run at the same time.

您可以使用 async 之类的东西使其依次运行:

You can use something like async to let it run sequentially:

phantom.create(function(ph) {
    ph.createPage(function(page) {
        var arrayList = ['string1', 'string2', 'string3'....];

        var tasks = arrayList.map(function(eachItem) {
            return function(callback){
                var webAddress = "http://www.example.com/" + eachItem;
                page.open(webAddress, function(status) {
                    console.log("opened site? ", status);

                    page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {

                        setTimeout(function() {
                            return page.evaluate(function() {
                                //code here for gathering data
                            }, function(result) {
                                callback(null, result);
                            });
                        }, 5000);
                    });
                });
            };
        });

        async.series(tasks, function(err, results){
            console.log("Finished");
            ph.exit();
        });
    });
});

当然,您还可以在每个任务内移动 phantom.create(),这将为每个请求创建一个单独的进程,但是上面的代码会更快.

Of course you can also move phantom.create() inside of each task which will create a separate process for each request, but the code above will be faster.

这篇关于如何使用Phantomjs设置页面抓取之间的时间间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆