通过在PhantomJS中循环来刮取多个URL [英] Scraping multiple URLs by looping in PhantomJS

查看:194
本文介绍了通过在PhantomJS中循环来刮取多个URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PhantomJS来刮取一些网站,因此用r提取信息。我正在遵循这个教程。一切工作正常单页,但我找不到任何简单的教程如何自动化多个页面。
我到目前为止的实验:

  var countries = [Albania,Afghanistan]; 
var len = countries.length;
var name1 =.html;
var add1 =http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=;
var country =;
var name =;
var add =;


(i = 1; i <= len; i ++){

country = countries [i]
name = country.concat (name1)
add = add1.concat(name1)

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = name
$ b page.open(add,function(status){
var content = page.content;
fs.write(path,content, 'w')
phantom.exit();
});




$ b我似乎没有得到任何错误时运行代码,该脚本只为第二个国家/地区创建一个html文件,其中包含有关我所感兴趣的小型表中的页面异常的所有信息。



类似的问题收集一些信息。然而,也因为我找不到一个简单的可重复的例子,我不明白我做错了什么。

解决方案

主要问题似乎是您退出的时间太早。您正在循环中创建多个页面实例。由于PhantomJS是异步的,对 page.open()的调用立即存在,并且下一个for循环迭代被执行。

for循环非常快,但是web请求很慢。这意味着甚至在加载第一页之前,您的循环就已经完全执行。这也意味着加载的第一个页面也将退出PhantomJS,因为您在页面的每个页面调用 phantom.exit()。 open()回调。

  var countFinished = 0,
maxFinished = len;
函数checkFinish(){
countFinished ++;
if(countFinished + 1 === maxFinished){
phantom.exit();



(i = 1; i <= len; i ++){
country = countries [i]
name = country .concat(name1)
add = add1.concat(country)

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = name
$ b page.open(add,function(status){
var content = page.content;
fs.write(path,content, 'w')
checkFinish();
});



$ b $ p
$ b

问题在于你正在创建很多页面实例没有清理。你应该关闭他们:

$ p $ for(i = 1; i <= len; i ++) {
(function(i){
country = countries [i]
name = country.concat(name1)
add = add1.concat(country)

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = name
$ b $ page.open(add,function(status){
var content = page.content;
fs.write(path,content,'w');
page.close();
checkFinish();
});
})(i);





$ b由于JavaScript具有函数级范围,所以需要使用一个IIFE在 page.open()回调中保留对正确的实例的引用。有关更多信息,请参阅以下问题:问题:循环中的JavaScript闭包 - 简单的实例



如果您之后不想清理,那么您应该对所有这些URL使用相同的页面实例。我已经在这里做了一个答案:答:循环网址做同样的事情


I am using PhantomJS to scrape some websites and therefore extract information with r. I am following this tutorial. Everything works fine for a single page, but I couldn't find any simple tutorial on how to automate for multiple pages. My experiments so far:

var countries = [ "Albania" ,"Afghanistan"];
var len = countries.length;
var name1 = ".html";
var add1 = "http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=";
var country ="";
var name ="";
var add="";


for (i=1; i <= len; i++){

    country = countries[i]
    name = country.concat(name1)
    add = add1.concat(name1)

    var webPage = require('webpage');
    var page = webPage.create();

    var fs = require('fs');
    var path = name

    page.open(add, function (status) {
        var content = page.content;
        fs.write(path,content,'w')
        phantom.exit();
    });

}

I don't seem to get any error when running the code, the script creates a html file only for the second country, which contains all information on the page exception made for the small table I am interested in.

I tried to gather some information from similar questions. However, also because I couldn't find a simple reproducible example, I don't understand what I am doing wrong.

解决方案

The main problem seems to be that you're exiting too early. You're creating multiple page instances in a loop. Since PhantomJS is asynchronous, the call to page.open() immediately exists and the next for loop iteration is executed.

A for-loop is pretty fast, but web requests are slow. This means that your loop is fully executed before even the first page is loaded. This also means that the first page that is loaded will also exit PhantomJS, because you're calling phantom.exit() in each of those page.open() callbacks. I suspect the second URL is faster for some reason and is therefore always written.

var countFinished = 0, 
    maxFinished = len;
function checkFinish(){
    countFinished++;
    if (countFinished + 1 === maxFinished) {
        phantom.exit();
    }
}

for (i=1; i <= len; i++) {
    country = countries[i]
    name = country.concat(name1)
    add = add1.concat(country)

    var webPage = require('webpage');
    var page = webPage.create();

    var fs = require('fs');
    var path = name

    page.open(add, function (status) {
        var content = page.content;
        fs.write(path, content,'w')
        checkFinish();
    });
}

The problem is that you're creating a lot of page instances without cleaning up. You should close them when you're done with them:

for (i=1; i <= len; i++) {
    (function(i){
        country = countries[i]
        name = country.concat(name1)
        add = add1.concat(country)

        var webPage = require('webpage');
        var page = webPage.create();

        var fs = require('fs');
        var path = name

        page.open(add, function (status) {
            var content = page.content;
            fs.write(path, content,'w');
            page.close();
            checkFinish();
        });
    })(i);
}

Since JavaScript has function-level scope, you would need to use an IIFE to retain a reference to the correct page instance in the page.open() callback. See this question for more information about that: Q: JavaScript closure inside loops – simple practical example

If you don't want to clean up afterwards, then you should use the same page instance over all of those URLs. I already have an answer about doing that here: A: Looping over urls to do the same thing

这篇关于通过在PhantomJS中循环来刮取多个URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆