循环网址做同样的事情 [英] Looping over urls to do the same thing

查看:122
本文介绍了循环网址做同样的事情的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要抓几个网站。这是我的代码:

I am tring to scrape a few sites. Here is my code:

for (var i = 0; i < urls.length; i++) {
    url = urls[i];
    console.log("Start scraping: " + url);

    page.open(url, function () {
        waitFor(function() {
            return page.evaluate(function() {
                return document.getElementById("progressWrapper").childNodes.length == 1;
            });

        }, function() {
            var price = page.evaluate(function() {
                // do something
                return price;
            });

            console.log(price);
            result = url + " ; " + price;
            output = output + "\r\n" + result;
        });
    });

}
fs.write('test.txt', output);
phantom.exit();

我想抓取数组网址中的所有网站,提取一些信息,然后将此信息写入文本文件。

I want to scrape all sites in the array urls, extract some information and then write this information to a text file.

但是for循环似乎有问题。在不使用循环的情况下仅抓取一个站点时,所有站点都可以正常工作。但是对于循环,首先没有任何反应,然后行

But there seems to be a problem with the for loop. When scraping only one site without using a loop, all works as I want. But with the loop, first nothing happens, then the line

console.log("Start scraping: " + url);

显示,但有一次太多了。
如果url = {a,b,c},那么phantomjs会:

is shown, but one time too much. If url = {a,b,c}, then phantomjs does:

Start scraping: a 
Start scraping: b 
Start scraping: c 
Start scraping:

它似乎没有调用page.open。
我是JS的新手,所以我很抱歉这个愚蠢的问题。

It seems that page.open isn't called at all. I am newbie to JS so I am sorry for this stupid question.

推荐答案

PhantomJS是异步的。通过使用循环多次调用 page.open(),您基本上会急于执行回调。在完成新请求之前,您将覆盖当前请求,然后再次覆盖该请求。你需要一个接一个地执行它们,例如:

PhantomJS is asynchronous. By calling page.open() multiple times using a loop, you essentially rush the execution of the callback. You're overwriting the current request before it is finished with a new request which is then again overwritten. You need to execute them one after the other, for example like this:

page.open(url, function () {
    waitFor(function() {
       // something
    }, function() {
        page.open(url, function () {
            waitFor(function() {
               // something
            }, function() {
                // and so on
            });
        });
    });
});

但这很乏味。有些实用程序可以帮助您编写更好的代码,例如 async.js 。您可以通过npm将它安装在phantomjs脚本的目录中。

But this is tedious. There are utilities that can help you with writing nicer code like async.js. You can install it in the directory of the phantomjs script through npm.

var async = require("async"); // install async through npm
var tests = urls.map(function(url){
    return function(callback){
        page.open(url, function () {
            waitFor(function() {
               // something
            }, function() {
                callback();
            });
        });
    };
});
async.series(tests, function finish(){
    fs.write('test.txt', output);
    phantom.exit();
});

如果你不想要任何依赖,那么定义你自己的递归函数也很容易(来自此处):

If you don't want any dependencies, then it is also easy to define your own recursive function (from here):

var urls = [/*....*/];

function handle_page(url){
    page.open(url, function(){
        waitFor(function() {
           // something
        }, function() {
            next_page();
        });
    });
}

function next_page(){
    var url = urls.shift();
    if(!urls){
        phantom.exit(0);
    }
    handle_page(url);
}

next_page();

这篇关于循环网址做同样的事情的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆