在一个脚本中使用多个page.open [英] Using multiple page.open in one script

查看:131
本文介绍了在一个脚本中使用多个page.open的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是打开许多页面(延迟很短)并将我的数据保存到文件中。

My goal is open many pages(with a short delay) and save my data to a file.

但我的代码不起作用。

var gamesList = [url1,url2,url3];
//gamesList is getting from a file

var urls = [];
var useragent = [];
useragent.push('Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14');
useragent.push('Opera/9.80 (X11; Linux x86_64; U; fr) Presto/2.9.168 Version/11.50');

var page = require('webpage').create();
page.settings.userAgent = useragent[Math.floor(Math.random() * useragent.length)];
console.log('Loading a web page');


function handle_page(url){
    page.open(url,function(){
        //...
        var html= page.evaluate(function(){
            // ...do stuff...
            page.injectJs('jquery.min.js');
            return $('body').html();
        });
        //save to file
        var file = fs.open('new_test.txt', "w");
        file.write(html + '\n');
        file.close();    

        console.log(html);

        setTimeout(next_page,1000);
    });
}

function next_page(urls){
    var url=urls.shift();
    if(!urls){
        phantom.exit(0);
    }
    handle_page(url);
}

next_page(urls);
phantom.exit();

我写作的地方是否重要 phantom.exit(); ?如果我最后在 page.open()回调中编写它,那么第一页就会打开。

Does it matter where I am writing phantom.exit();? If I am writing it in the page.open() callback in the end then the 1st page opens well.

推荐答案

您打算使用递归打开多个页面的想法是正确的,但是您遇到了一些问题。

Your idea of opening multiple pages with recursion is correct, but you have some problems.

正如您所正确指出的那样,您遇到了 phantom.exit()的问题。由于 page.open() setTimeout()是异步的,因此只需在完成后退出。当您在脚本结束时调用 phantom.exit()时,您将在第一页加载之前退出。

As you correctly noted, you have a problem with phantom.exit(). Since page.open() and setTimeout() are asynchronous, you only need to exit when you are done. When you call phantom.exit() at the end of the script, you're exiting before the first page is even loaded.

只需删除最后一个 phantom.exit(),因为您已经在正确的位置有另一个出口。

Simply remove that last phantom.exit(), because you already have another exit at the correct place.

page.evaluate()提供访问DOM上下文(页面上下文)。问题是它是沙箱。在该回调内部,您无法访问外部定义的变量。您可以显式传递变量,但它们必须是 page 不是的原始对象。您只需访问 page.evaluate()内的。你需要在调用 page.evaluate()之前注入jQuery。

page.evaluate() provides access to the DOM context (page context). The problem is that it is sandboxed. Inside of that callback you have no access to variables defined outside. You can explicitly pass variables in, but they have to be primitive objects which page is not. You simply have to access to page inside of page.evaluate(). You need to inject jQuery before calling page.evaluate().

您通过不更改文件名来覆盖每次迭代中的文件。您需要更改文件名或使用附加模式'a'而不是'w'

You're overwriting the file in every iteration by not changing the file name. Either you need to change the filename or use the appending mode 'a' instead of 'w'.

然后,当您只想写一次时,您不需要打开流。更改:

Then you don't need to open a stream when you simply want to write once. Change:

var file = fs.open('new_test.txt', "w");
file.write(html + '\n');
file.close();

fs.write('new_test.txt', html + '\n', 'a');



递归步骤



递归步骤调用 next_page()函数需要传入URL。由于 urls 已经是一个全局变量,并且您在每次迭代中都更改它,因此您无需传入 urls

Recursive step

The recursive step with calling the next_page() function requires that you pass in the urls. Since urls is already a global variable and you change it in each iteration, you don't need to pass in the urls.

你也不需要添加 setTimeout(),因为<...之前的所有内容code> page.open()回调是同步的。

You also don't need to add a setTimeout(), because everything before inside of the page.open() callback was synchronous.

//...
var urls = [/*....*/];

function handle_page(url){
    page.open(url, function(){
        //...
        page.injectJs('jquery.min.js');
        var html = page.evaluate(function(){
            // ...do stuff...
            return $('body').html();
        });
        //save to file
        fs.write('new_test.txt', html + '\n', 'a');

        console.log(html);

        next_page();
    });
}

function next_page(){
    var url = urls.shift();
    if(!url){
        phantom.exit(0);
    }
    handle_page(url);
}

next_page();

这篇关于在一个脚本中使用多个page.open的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆