PhantomJS在评估前更改网页内容 [英] PhantomJS change webpage content before evaluating

查看:124
本文介绍了PhantomJS在评估前更改网页内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想删除一个HTML元素,或者在评估/呈现之前删除网页的前N个字符。

I'd like to either remove an HTML element or simply remove first N characters of a webpage before evaluating/rendering it.

有没有办法?

推荐答案

这取决于多种情况。我将仅列出以下问题的每个组合的步骤。

It depends on multiple scenarios. I will only outline the steps for each combination of the answers to the following questions.


  1. JS的片段称为onload( ol )或立即评估脚本块( ie )?

  2. 是否是内联脚本()或者脚本是否单独加载(src属性)( ls )?

  3. 脚本块是否还包含一些不应该被删除的代码( nr )或可以完全删除( rc )?

  1. Is the piece of JS called onload (ol) or is the script block immediately evaluated (ie)?
  2. Is it an inline script (is) or is the script loaded separately (src attribute) (ls)?
  3. Does the script block also contain some code that should not be removed (nr) or can it be removed completely (rc)?



1。脚本单独加载(ls)&代码可以完全删除(rc)



注册到 onResourceRequested 监听器和请求。 abort()取决于匹配的网址。

1. Script is loaded separately (ls) & code can be removed completely (rc)

Register to the onResourceRequested listener and request.abort() depending on the matched url.

只有当以下代码块不依赖于不应该被删除的代码(这不太可能)时,才能完成此操作。这很有可能需要在DOM中注册的点击事件。

This can only be done when the following code blocks do not depend on the code that should not be removed (which is unlikely). This is most likely necessary for click events that are registered in the DOM.

在这种情况下,请取消请求,如 1。,下载脚本通过XHR,删除不需要的代码部分,并添加代码块到DOM。为了使其工作,您需要禁用Web安全性,因为如果不在同一个域上,则不需要资源: - web-security = false

In this case cancel the request like in 1., download the script through an XHR, remove the unwanted code parts and add code block to the DOM. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain: --web-security=false.

这可能非常容易出错。您将开始一个间隔, setInterval来自 <$的功能(){},5) c $ c> page.onInitialized 回调。在间隔期间,您需要检查是否在页面上下文中设置了 window.onload (或者您可以获取的其他东西)。你删除它,如果它确实是要删除的功能,通过检查 window.onload.toString()。match(/ something /)

This is probably very error prone. You would begin an Interval with setInterval(function(){}, 5) from a page.onInitialized callback. Inside the interval you would need to check if window.onload (or something else you can get your hands on) is set in the page context. You remove it, if it is indeed the function that you wanted to remove by checking window.onload.toString().match(/something/).

这可以直接和完全地在页面上下文中完成( page.evaluate )。

This can be done directly and completely inside the page context (inside page.evaluate).

开始像 3。,而不是删除 window.onload ,您可以执行

Begin like in 3., but instead of removing window.onload, you can do

eval("window.onload = " + window.onload.toString().replace(/something/,''))



5。脚本加载了DOM(is)&脚本块立即评估(即)



您可以将页面加载为XHR,替换文本并将调整的内容应用于页面。这将基本上是一个填充的关于:空白页面。为此,您将需要禁用Web安全性,因为如果不在同一个域上,则不需要资源: - web-security = false - 本地至远程URL的访问=真。这也适用于 3。 4

5. Script is loaded with the DOM (is) & the script block immediately evaluated (ie)

You can load the page as an XHR, replace the text and apply the adjusted content to the page. This will essentially be a filled about:blank page. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain: --web-security=false or --local-to-remote-url-access=true. This would also work for 3. and 4..

仍然有一个问题。大多数时候页面不使用完整的URL。所以当脚本或元素引用 stuff.php PhantomJS不能请求它。当设置 page.content 时,页面URL基本上是关于:空白,所有具有不完整URL的请求都指向 file:/// .. 。显然没有这样的文件。这些资源必须用完整的URL对应方式替换。

有三种类型的URL:

There is still one problem though. Pages don't use full URLs most of the time. So when a script or element refers to stuff.php PhantomJS cannot request it. When the page.content is set then the page URL is essentially about:blank and all requests with incomplete URLs point to file:///.... Obviously there are no such files. Those resources must be replaced with their full URL counterparts.
There are three types of such URLs:


  • // example.com/resource.php 变量协议

  • /resource.php 可变协议和域

  • resource.php 可变协议,资源的域和路径

  • //example.com/resource.php variable protocol
  • /resource.php variable protocol and domain
  • resource.php variable protocol, domain and path to resource

完整示例:

var page = require('webpage').create(),
    url = 'http://www.example.com';

page.open(url, function(status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var content = page.evaluate(function(url){
            var xhr = new XMLHttpRequest();
            xhr.open("GET", url, false);
            xhr.send();
            return xhr.responseText;
        }, url);
        page.render("test_example.png");
        page.content = content.replace(/xample/g,"asy");
        page.render("test_easy.png");
        console.log("url "+page.url); // about:blank
        phantom.exit();
    }
});

除了简单的字符串替换,您可能需要查看正确的操作技巧。

You might want to look into proper manipulation techniques apart from the simple string replace.

这篇关于PhantomJS在评估前更改网页内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆