我可以使用phantomjs / casperjs获取原始页面源(与当前DOM相比)吗? [英] Can I get the original page source (vs current DOM) with phantomjs/casperjs?

查看:89
本文介绍了我可以使用phantomjs / casperjs获取原始页面源(与当前DOM相比)吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取特定网页的原始来源。

I am trying to get the original source for a particular web page.

页面执行一些脚本,一旦加载就修改DOM。我想在任何脚本或用户更改文档中的任何对象之前获取源代码。

The page executes some scripts that modify the DOM as soon as it loads. I would like to get the source before any script or user changes any object in the document.

使用Chrome或Firefox(可能是大多数浏览器)我可以查看DOM(调试实用程序F12)或查看原始源(右键单击,查看源代码)。后者是我想要完成的。

With Chrome or Firefox (and probably most browsers) I can either look at the DOM (debug utility F12) or look at the original source (right-click, view source). The latter is what I want to accomplish.

是否可以使用phantomjs / casperjs执行此操作?

Is it possible to do this with phantomjs/casperjs?

在进入页面之前,我必须登录。这与casperjs一起工作正常。
如果我浏览页面并呈现结果,我知道我在右页。

Before getting to the page I have to log in. This is working fine with casperjs. If I browse to the page and render the results I know I am on the right page.

casper.thenOpen('http://'+customUrl, function(response) {
    this.page.render('example.png'); // *** Renders correct page (current DOM) ***
    console.log(this.page.content); // *** Gets current DOM ***
    casper.download('view-source:'+customUrl, 'b.html', 'GET'); // *** Blank page ***
    console.log(this.getHTML()); // *** Gets current DOM ***
    this.debugPage(); // *** Gets current DOM ***
    utils.dump(response); // *** No BODY ***
    casper.download('http://'+customUrl, 'a.html', 'GET');  // *** Not logged in ?! ***
});

我试过 this.download(url,'a.html' )但它似乎没有共享相同的上下文,因为它返回HTML就像我没有登录一样,即使我使用cookie运行 casperjs test.casper.js - -cookies-file = cookies.txt

I've tried this.download(url, 'a.html') but it doesn't seem to share the same context since it returns HTML as if I was not logged in, even if I run with cookies casperjs test.casper.js --cookies-file=cookies.txt.

我相信我应该继续分析这个选项。

I believe I should keep analyzing this option.

我还试过 casper.open('view-source:url')而不是 casper.open('http:// url')但似乎它无法识别网址,因为我只是得到一个空白页。

I have also tried casper.open('view-source:url') instead of casper.open('http://url') but it seems it doesn't recognize the url since I just get a blank page.

我已经查看了我从服务器获得的原始HTTP响应,我有一个实用程序,这个消息的主体(这是HTML)是我需要的但是当页面在浏览器中加载时DOM已经已被修改。

I have looked at the raw HTTP Response I get from the server with a utility I have and the body of this message (which is HTML) is what I need but when the page loads in the browser the DOM has already been modified.

我试过:

casper.thenOpen('http://'+url, function(response) {
    ...
}

回复 object只包含标题和其他一些信息,但不包含正文。

But the response object only contains the headers and some other information but not the body.

我也试过了这个事件 onResourceRequested

想法是中止特定网页(引用者)所需的任何资源的下载。

The idea is to abort the download of any resource needed by a specific web page (the referer).

onResourceRequested: function(casperObj, requestData, networkRequest) {
for (var i=0; i < requestData.headers.length; i++) {
    var obj = requestData.headers[i];
    if (obj.name === "Referer" && obj.value === 'http://'+customUrl) {
        networkRequest.abort();
        break;
    }
}

不幸的是,最初修改DOM的脚本似乎是内联主HTML页面(或者这段代码没有做我想做的事情)。

Unfortunately the script that modifies the DOM initially seems to be inline the main HTML page (or this code is not doing what I would like it to do).

¿任何想法?

以下是完整代码:

phantom.casperTest = true;
phantom.cookiesEnabled = true;

var utils = require('utils');
var casper = require('casper').create({
    clientScripts:  [],
    pageSettings: {
        loadImages:  false,
        loadPlugins: false,
        javascriptEnabled: true,
        webSecurityEnabled: false
    },
    logLevel: "error",
    verbose: true
});

casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');

casper.start('http://www.xxxxxxx.xxx/login');

casper.waitForSelector('input#login',
    function() {
        this.evaluate(function(customLogin, customPassword) {
            document.getElementById("login").value = customLogin;
            document.getElementById("password").value = customPassword;
            document.getElementById("button").click();
        }, {
            "customLogin": customLogin,
            "customPassword": customPassword
        });
    },
    function() {
        console.log('Can't login.');
    },
    15000
);

casper.waitForSelector('div#home',
    function() {
        console.log('Login successfull.');
    },
    function() {
        console.log('Login failed.');
    },
    15000
);

casper.thenOpen('http://'+customUrl, function(response) {
    this.page.render('example.png'); // *** Renders correct page (current DOM) ***
    console.log(this.page.content); // *** Gets current DOM ***
    casper.download('view-source:'+customUrl, 'b.html', 'GET'); // *** Blank page ***
    console.log(this.getHTML()); // *** Gets current DOM ***
    this.debugPage(); // *** Gets current DOM ***
    utils.dump(response); // *** No BODY ***
    casper.download('http://'+customUrl, 'a.html', 'GET');  // *** Not logged in ?! ***
});


推荐答案

嗯,你尝试过使用某些活动吗?例如:

Hum, did you try using some events? For example :

casper.on('load.started', function(resource) {
    casper.echo(casper.getPageContent());
});

我认为它无效,无论如何都要试试。

I think it won't work, try it anyway.

问题是:您无法在正常的casperJS步骤中执行此操作,因为页面上的脚本已经执行。如果我们可以绑定on-DOM-Ready事件,或者有类似的特定casper事件,它可以工作。问题:必须加载页面才能将一些js从Casper发送到DOM环境。所以绑定onready是不可能的(我不知道如何)。我认为使用幻像我们可以在加载事件之后刮掉DATA,所以只有在页面被渲染时。

The problem is : you can't do it in a normal casperJS step because the scripts on your page are already executed. It could work if we could bind the on-DOM-Ready event, or have a specific casper event like that. Problem : the page must be loaded to send some js from Casper to the DOM environment. So binding onready isn't possible (I don't see how). I think with phantom we can scrape DATA after the load event, so only when the page is rendered.

所以如果不可能用事件来破解它,也许是延迟,你唯一的解决方案是阻止修改你的DOM的脚本。

So if it's not possible to hack it with the events and maybe some delay, your only solution is to block the scripts which modify your DOM.

还有phantomJS选项,你使用它:在casper中:

There is still the phantomJS option, you use it : in casper :

casper.pageSettings.javascriptEnabled = false;

问题是你需要启用js来取回数据,所以它无法工作。 ..:p是没用的评论! :)

The problem is you need the js enabled to get back the data, so it can't work... :p Yeah useless comment ! :)

否则你必须阻止想要的使用事件修改DOM的ressource /脚本。

Otherwise you have to block the wanted ressource/script which modify the DOM using events.

或您可以使用 resource.received 事件来修改在修改DOM的特定资源出现之前所需的数据。

Or you could use the resource.received event to scrape the data wanted before the specific resources modifing DOM appear.

事实上我认为这是不可能的,因为如果你创建了一个步骤,在特定资源出现之前只从页面获取一些数据,执行步骤的时间,资源将有负载。在您的步骤刮取数据时,有必要冻结以下资源。

In fact I don't think it's possible because if you create a step which get back some data from page just before specific ressources appear, the time your step is executed, the ressources will have load. It would be necessary to freeze the following ressources while your step is scraping the data.

不知道怎么做,但这些事件可以帮到你:

Don't know how to do it though, but these events could help you :

casper.on('resource.requested', function(request) {
    console.log(" request " + request.url);
});

casper.on('resource.received', function(resource) {
    console.log(resource.url);
});

casper.on('resource.error',function (request) {
    this.echo('[res : id and url + error description] <-- ' + request.id + ' ' + request.url + ' ' + request.errorString);
});

另见如何在CasperJS中禁用css?
可行的解决方案:识别脚本并阻止它们。但如果你需要它们,我不知道,这是一个很好的问题。也许我们可以推迟执行特定的脚本。我不认为Casper和幻影很容易允许。唯一有用的选项是 abort(),给我们这个选项: timeout(time - > ms)

See also How do you Disable css in CasperJS?. The solution which would work : you identify the scripts and block them. But if you need them, well I don't know, it's a good question. Maybe we could defer the execution of a specific script. I don't think Casper and phantom easily permit that.The only useful option is abort(), give us this option : timeout("time -> ms") !

onResourceRequested

这里有一个类似的问题:在其他之前注入脚本

Here a similar question : Injecting script before other

这篇关于我可以使用phantomjs / casperjs获取原始页面源(与当前DOM相比)吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆