遍历 url 来做同样的事情 [英] Looping over urls to do the same thing

查看:29
本文介绍了遍历 url 来做同样的事情的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一些网站.这是我的代码:

I am tring to scrape a few sites. Here is my code:

for (var i = 0; i < urls.length; i++) {
    url = urls[i];
    console.log("Start scraping: " + url);

    page.open(url, function () {
        waitFor(function() {
            return page.evaluate(function() {
                return document.getElementById("progressWrapper").childNodes.length == 1;
            });

        }, function() {
            var price = page.evaluate(function() {
                // do something
                return price;
            });

            console.log(price);
            result = url + " ; " + price;
            output = output + "
" + result;
        });
    });

}
fs.write('test.txt', output);
phantom.exit();

我想抓取数组 urls 中的所有站点,提取一些信息,然后将此信息写入文本文件.

I want to scrape all sites in the array urls, extract some information and then write this information to a text file.

但是for循环好像有问题.在不使用循环的情况下仅抓取一个站点时,一切都可以按我的意愿工作.但是在循环中,首先什么也没有发生,然后是一行

But there seems to be a problem with the for loop. When scraping only one site without using a loop, all works as I want. But with the loop, first nothing happens, then the line

console.log("Start scraping: " + url);

显示,但一次太多了.如果 url = {a,b,c},那么 phantomjs 会:

is shown, but one time too much. If url = {a,b,c}, then phantomjs does:

Start scraping: a 
Start scraping: b 
Start scraping: c 
Start scraping:

似乎根本没有调用 page.open.我是 JS 的新手,所以我很抱歉这个愚蠢的问题.

It seems that page.open isn't called at all. I am newbie to JS so I am sorry for this stupid question.

推荐答案

PhantomJS 是异步的.通过使用循环多次调用 page.open(),您实际上是在加快回调的执行.您在完成新请求之前覆盖当前请求,然后再次覆盖.您需要一个接一个地执行它们,例如像这样:

PhantomJS is asynchronous. By calling page.open() multiple times using a loop, you essentially rush the execution of the callback. You're overwriting the current request before it is finished with a new request which is then again overwritten. You need to execute them one after the other, for example like this:

page.open(url, function () {
    waitFor(function() {
       // something
    }, function() {
        page.open(url, function () {
            waitFor(function() {
               // something
            }, function() {
                // and so on
            });
        });
    });
});

但这很乏味.有一些实用程序可以帮助您编写更好的代码,例如 async.js.可以通过npm安装在phantomjs脚本目录下.

But this is tedious. There are utilities that can help you with writing nicer code like async.js. You can install it in the directory of the phantomjs script through npm.

var async = require("async"); // install async through npm
var tests = urls.map(function(url){
    return function(callback){
        page.open(url, function () {
            waitFor(function() {
               // something
            }, function() {
                callback();
            });
        });
    };
});
async.series(tests, function finish(){
    fs.write('test.txt', output);
    phantom.exit();
});

如果您不想要任何依赖项,那么定义您自己的递归函数也很容易(来自此处):

If you don't want any dependencies, then it is also easy to define your own recursive function (from here):

var urls = [/*....*/];

function handle_page(url){
    page.open(url, function(){
        waitFor(function() {
           // something
        }, function() {
            next_page();
        });
    });
}

function next_page(){
    var url = urls.shift();
    if(!urls){
        phantom.exit(0);
    }
    handle_page(url);
}

next_page();

这篇关于遍历 url 来做同样的事情的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆