通过在 PhantomJS 中循环来抓取多个 URL [英] Scraping multiple URLs by looping in PhantomJS

查看:33
本文介绍了通过在 PhantomJS 中循环来抓取多个 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PhantomJS 抓取一些网站,因此使用 r 提取信息.我正在关注 this 教程.对于单个页面一切正常,但我找不到任何关于如何自动化多个页面的简单教程.我目前的实验:

I am using PhantomJS to scrape some websites and therefore extract information with r. I am following this tutorial. Everything works fine for a single page, but I couldn't find any simple tutorial on how to automate for multiple pages. My experiments so far:

var countries = [ "Albania" ,"Afghanistan"];
var len = countries.length;
var name1 = ".html";
var add1 = "http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=";
var country ="";
var name ="";
var add="";


for (i=1; i <= len; i++){

    country = countries[i]
    name = country.concat(name1)
    add = add1.concat(name1)

    var webPage = require('webpage');
    var page = webPage.create();

    var fs = require('fs');
    var path = name

    page.open(add, function (status) {
        var content = page.content;
        fs.write(path,content,'w')
        phantom.exit();
    });

}

我在运行代码时似乎没有遇到任何错误,脚本仅为第二个国家/地区创建了一个 html 文件,其中包含有关为我感兴趣的小表所做的页面异常的所有信息.

I don't seem to get any error when running the code, the script creates a html file only for the second country, which contains all information on the page exception made for the small table I am interested in.

我试图从类似问题中收集一些信息.然而,也因为我找不到一个简单的可重现的例子,我不明白我做错了什么.

I tried to gather some information from similar questions. However, also because I couldn't find a simple reproducible example, I don't understand what I am doing wrong.

推荐答案

主要问题似乎是您退出得太早了.您正在循环中创建多个 page 实例.由于 PhantomJS 是异步的,对 page.open() 的调用立即存在并执行下一个 for 循环迭代.

The main problem seems to be that you're exiting too early. You're creating multiple page instances in a loop. Since PhantomJS is asynchronous, the call to page.open() immediately exists and the next for loop iteration is executed.

for 循环相当快,但 Web 请求很慢.这意味着您的循环甚至在加载第一页之前就已完全执行.这也意味着加载的第一个页面也将退出 PhantomJS,因为您在每个 page.open() 回调中调用了 phantom.exit().我怀疑由于某种原因第二个 URL 更快,因此总是被写入.

A for-loop is pretty fast, but web requests are slow. This means that your loop is fully executed before even the first page is loaded. This also means that the first page that is loaded will also exit PhantomJS, because you're calling phantom.exit() in each of those page.open() callbacks. I suspect the second URL is faster for some reason and is therefore always written.

var countFinished = 0, 
    maxFinished = len;
function checkFinish(){
    countFinished++;
    if (countFinished + 1 === maxFinished) {
        phantom.exit();
    }
}

for (i=1; i <= len; i++) {
    country = countries[i]
    name = country.concat(name1)
    add = add1.concat(country)

    var webPage = require('webpage');
    var page = webPage.create();

    var fs = require('fs');
    var path = name

    page.open(add, function (status) {
        var content = page.content;
        fs.write(path, content,'w')
        checkFinish();
    });
}

问题是您在没有清理的情况下创建了很多 page 实例.完成后您应该关闭它们:

The problem is that you're creating a lot of page instances without cleaning up. You should close them when you're done with them:

for (i=1; i <= len; i++) {
    (function(i){
        country = countries[i]
        name = country.concat(name1)
        add = add1.concat(country)

        var webPage = require('webpage');
        var page = webPage.create();

        var fs = require('fs');
        var path = name

        page.open(add, function (status) {
            var content = page.content;
            fs.write(path, content,'w');
            page.close();
            checkFinish();
        });
    })(i);
}

由于 JavaScript 具有函数级作用域,您需要使用 IIFE 来保留对 page.open() 回调中正确 page 实例的引用.有关更多信息,请参阅此问题:Q:循环内的 JavaScript 闭包 - 简单实用示例

Since JavaScript has function-level scope, you would need to use an IIFE to retain a reference to the correct page instance in the page.open() callback. See this question for more information about that: Q: JavaScript closure inside loops – simple practical example

如果您不想事后清理,那么您应该对所有这些 URL 使用相同的 page 实例.我在这里已经有了一个关于这样做的答案:A:循环遍历 url 来做同样的事情

If you don't want to clean up afterwards, then you should use the same page instance over all of those URLs. I already have an answer about doing that here: A: Looping over urls to do the same thing

这篇关于通过在 PhantomJS 中循环来抓取多个 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆