获取HTML的当前样式(可能内联的)完成渲染和完成运行脚本的页面 [英] Get HTML with current styles (maybe inlined) of a page that finished rendering and finished running scripts

查看:251
本文介绍了获取HTML的当前样式(可能内联的)完成渲染和完成运行脚本的页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要获取HTML的当前样式(也许内联)的一个页面,完成渲染和完成运行脚本,使用服务器端应用程序,将只给一个URL(没有额外的信息,如cookies,没有POST,



使用浏览器库的临时运行的浏览器或独立实用程序的网桥/代理是一个被接受的解决方案(然而,所选择的浏览器或浏览器库必须在所有主要平台上可用,并且必须能够在没有显示或安装操作系统GUI的情况下运行)。



可选的要求是删除所有脚本之后(已经有独立的解决方案,这里添加它,因为也许给定的答案将能够删除脚本,而渲染或类似的东西)。



如何使用当前样式(可能内联)和当前图片(使用数据URI )?



如果可以使用纯PHP,这将是一个加号我不知道,我没有发现什么有趣的)。



编辑:我知道如何加载HTTP资源, URL,这不是我要找的)



编辑2
示例输入HTML:

 <!DOCTYPE HTML PUBLIC -  // W3C // DTD HTML 4.01 // ENhttp://www.w3.org/TR /html4/strict.dtd\"> 
< html>
< head>
< title>< / title>

< meta http-equiv =Content-Typecontent =text / html; charset = utf-8>

< link rel =stylesheettype =text / csshref =/ css / example.css>
< script type =text / javascriptsrc =/ javascript / example.js>< / script>

< script type =text / javascript>
window.addEventListener(load,
function(event){
document.title =New title;

document.getElementById(pic_0) .style.border =0px;
}
);
< / script>
< style type =text / css>
p {
color:blue;
}
< / style>
< / head>
< body>
< p> Hello world!< / p>
< p>
< img
alt =
style =border:1px
id =pic_0
src =http://linuxgazette.net/ 144 / misc / john / helloworld.png
>
< / p>
< / body>
< / html>

输出示例:

 <!DOCTYPE HTML PUBLIC -  // W3C // DTD HTML 4.01 // ENhttp://www.w3.org/TR/html4/strict.dtd\"> 
< html>
< head>
< title>新标题< / title>

< meta http-equiv =Content-Typecontent =text / html; charset = utf-8>

< style type =text / css>
b {font-weight:bold}
< / style>

< style type =text / css>
p {
color:blue;
}
< / style>
< / head>
< body>
< p> Hello world!< / p>
< p>
< img
alt =
style =border:0px
id =pic_0
src =data:image / png; iVBORw0KGgoAAAANSUhEUgAAACgAAAAoBAMAAAB + 0KVeAAAAK3RFWHRDcmVhdGlvbiBUaW1lAFYgMzEgYXVnLiAyMDEyIDE3OjU4OjU1ICswMjAwWMdbPwAAAAd0SU1FB9wIHw8ABeoUyU4AAAAJcEhZcwAACxIAAAsSAdLdfvwAAAAEZ0FNQQAAsY8L / GEFAAAABlBMVEX /// 8AAABVwtN + AAAAXklEQVR42uWQUQ6AMAhD6Q3a + 19WqsawwMf + NLEfy3iDlC7idTGQp / YglFAsUMqSwjlQOhN3mIMTHDq70SeEWBbt0EG8POWkDySvmCh / SssvNfwIfb + hFmgjFKPf6gDQBAQ368m09AAAAABJRU5ErkJggg ==
>
< / p>
< / body>
< / html>

请注意< title> 标记改变, border:1px 如何变成 border:0px ,如何将图片网址转换为数据URI



例如,某些转换(内联CSS和



编辑3 :使用页面内容(样式和图片)替换外部资源并删除javascript是一个简单的部分。



编辑4 也许这可以使用注入的JavaScript来完成(仍然需要浏览器控制)?

解决方案

PhantomJS < a>是一个带有JavaScript API的无头(无GUI)WebKit。
它在所有主要平台上运行,正如我在我的问题中所要求的。



它可以运行Javascript脚本来控制无GUI的Web浏览器。它有一个强大的API,很多很多的例子。



在我的空闲时间在过去2-3天我写了我的问题的解决方案,它涵盖所有要求精美。我尚未找到无法使用的网页。





  phantomjs save_as_html.js http://stackoverflow.com/q/12215844/584490 saved.html 





Javascript允许运行<$ c $





注意:




  • 在可能的情况下,XHR资源加载优先于HTML5的画布渲染,


  • < link> < img> 标记保留在原位,并且 data: src属性,而不是URL。对于在所有DOM节点上使用getComputedStyle()读取的 background-image 也是如此。


  • < script> 标记和事件处理程序属性已删除。


  • < link> code>也被删除(也许他们不应该,而是固定为绝对URL,如果相对)。


  • < iframe> 当前未处理,并且其src属性beeing设置为 about:blank

    li>




谨防所有跨站点脚本安全限制被取消,以便可以加载所有资源。确保您在使用Facebook帐户的某些秘密凭证时不会尝试保存恶意网页:)





save_as_html.js 内容:

  //http://stackoverflow.com/a/12256190/584490 

var page = require('webpage')。创建();
page.onConsoleMessage = function(msg){console.log(msg); };

var system = require('system');
var address,output,size;


if(system.args.length!= 3)
{
console.log('Usage:save_as_html.js URL filename');
phantom.exit(1);
}
else
{
address = system.args [1];
output = system.args [2];

page.viewportSize = {
width:1680,
height:1050,
};

// SECURITY_ERR:DOM异常18:尝试突破用户代理的安全策略。
//启用跨站脚本:
page.settings.XSSAuditingEnabled = false;
page.settings.localToRemoteUrlAccessEnabled = true;
page.settings.webSecurityEnabled = false;

page.settings.userAgent =Mozilla / 5.0(Windows NT 6.1; WOW64)AppleWebKit / 537.1(KHTML,像Gecko)Chrome / 22.0.1207.1 Safari / 537.1;
page.settings.ignoreSslErrors = true;

page.open(address,function(status){
if(status!=='success')
{
console.log(Unable to load URL,返回状态:+ status);
phantom.exit(1);
}
else
{
window.setTimeout(function(){
page.evaluate(function(){
var nodeList = document.getElementsByTagName(*);

var arrEventHandlerAttributes = [
onblur,onchange onclick,ondblclick,onfocus,onkeydown,onkeyup,onkeypress,onkeyup,onload,
onmousedown,onmousemove,onmouseout ,onmouseup,onreset,onselect,onsubmit,onunload
];


//http://stackoverflow.com/a / 7372816/584490
var base64Encode = function(str)
{
var CHARS =ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 + /;
var out =,i = 0,len = str.length,c1,c2,c3;
while(i c1 = str.charCodeAt(i ++)& 0xff;
if(i == len){
out + = CHARS.charAt(c1>> 2);
out + = CHARS.charAt((c1& 0x3)<< 4);
out + ===;
break;
}
c2 = str.charCodeAt(i ++);
if(i == len){
out + = CHARS.charAt(c1>> 2);
out + = CHARS.charAt(((c1& 0x3)<< 4)|((c2& 0xF0)> 4)
out + = CHARS.charAt((c2& 0xF)<< 2);
out + ==;
break;
}
c3 = str.charCodeAt(i ++);
out + = CHARS.charAt(c1>> 2);
out + = CHARS.charAt(((c1& 0x3)<< 4)|((c2& 0xF0)> 4)
out + = CHARS.charAt(((c2& 0xF)< 2)|((c3& 0xC0)> 6)
out + = CHARS.charAt(c3& 0x3F);
}
return out;
};


for(var n = nodeList.length-1; n> 0; n--)
{
try
{
var el = nodeList [n];

if(el.nodeName ==IMG& el.src.substr(0,5)!=data:)
{
/ * var canvas = document.createElement(canvas);

canvas.width = parseInt(el.width);
canvas.height = parseInt(el.height);

var ctx = canvas.getContext(2d);
ctx.drawImage(el,0,0);
el.src = canvas.toDataURL(); * /

var xhr = new XMLHttpRequest();

xhr.open(
get,
el.src,
/ * Asynchronous * / false
);

xhr.overrideMimeType(text / plain; charset = x-user-defined);

xhr.send(null);

var strResponseContentType = xhr.getResponseHeader(Content-type)。split(;)[0] .replace(/ [^ a-z0-9\ / );
el.src =data:+ strResponseContentType +; base64,+ base64Encode(xhr.responseText);
}
else if(el.nodeName ==LINK)
{
if(el.rel ==alternate)
{
el.parentNode.removeChild(el);
}
else if(el.href.substr(0,5)!=data:)
{
var xhr = new XMLHttpRequest

xhr.open(
get,
el.href,
/ * Asynchronous * / false
);

xhr.overrideMimeType(text / plain; charset = x-user-defined);

xhr.send(null);

// var strResponseContentType = xhr.getResponseHeader(Content-type)。split(;)[0] .replace(/ [^ a-z0-9\ / gi,);
//el.href=\"data:\"+strResponseContentType+\";base64,\"+base64Encode(xhr.responseText);
el.href =data:+ el.type +; base64,+ base64Encode(xhr.responseText);
}

continue;
}
else if(el.nodeName ==SCRIPT)
{
el.parentNode.removeChild(el);

continue;
}
else if(el.nodeName ==IFRAME)
{
el.src =about:blank;

continue;
}

for(var z = arrEventHandlerAttributes.length-1; z> = 0; z--)
el.removeAttribute(arrEventHandlerAttributes [z]);

var strBackgroundImageURL = window.getComputedStyle(el).getPropertyValue(background-image)。replace(/ [\s] / g,);
if(strBackgroundImageURL.substr(0,4)==url(&& strBackgroundImageURL.substr(4,5)!=data:)
{
strBackgroundImageURL = strBackgroundImageURL.substr(4,strBackgroundImageURL.length-5);

/ * var imageTemp = document.createElement(img);
imageTemp.src = strBackgroundImageURL;

imageTemp.onload = function(e){
var canvas = document.createElement(canvas);

canvas.width = parseInt(imageTemp.width);
canvas.height = parseInt(imageTemp.height);

var ctx = canvas.getContext(2d);
ctx.drawImage(imageTemp,0,0);
el.style.backgroundImage =url(+ canvas.toDataURL()+);
};

if(imageTemp.complete)
imageTemp.onload();
* /

var xhr = new XMLHttpRequest();

xhr.open(
get,
strBackgroundImageURL,
/ * Asynchronous * / false
);

xhr.overrideMimeType(text / plain; charset = x-user-defined);

xhr.send(null);

var strResponseContentType = xhr.getResponseHeader(Content-type)。split(;)[0] .replace(/ [^ a-z0-9\ / );
el.style.backgroundImage =url(+data:+ strResponseContentType +; base64,+ base64Encode(xhr.responseText)+);
}

if(el.nodeName ==A)
{
el.href =#; // TODO将相对路径转换为绝对路径(保留URL);
el.setAttribute(onclick,return false;); // TODO:当上面的内容是固定的时候删除这个。
}
else if(el.nodeName ==FORM)
{
el.setAttribute(action,);
el.setAttribute(onsubmit,return false;);
}
}
catch(error)
{
//可以做什么?
}
}
});

require(fs)。write(output,page.content,w);

phantom.exit()
},1000);
}
});
}


I need to get the HTML with current styles (maybe inlined) of a page that finished rendering and finished running scripts, using a server side application which will be given just an URL (no extra information such as cookies, no POSTs, no impeding forms, etc.).

A bridge/proxy to a temporarily running browser or a stand alone utility using a browser library is an accepted solution (however, the chosen browser or browser library must be available on all major platforms, and must be able to run without an OS GUI beeing present or installed).

An optional requirement is to remove all scripts afterwards (there are already stand alone solutions for this, adding it here because maybe the given answer will be able to remove scripts while rendering or something like that).

How do I get a snapshot in HTML+CSS in a single .html file of the curent HTML document with the current styles (maybe inlined) and current images (using data URI)?

If it can be done using pure PHP it would be a plus (although I doubt it, I haven't found anything interesting).

Edit: I know how to load HTTP resources and get the HTML for an URL, that's not what I'm looking for ;)

Edit 2 Example input HTML:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <title></title>

        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">

        <link rel="stylesheet" type="text/css" href="/css/example.css">
        <script type="text/javascript" src="/javascript/example.js"></script>

        <script type="text/javascript">
            window.addEventListener("load",
                function(event){
                    document.title="New title";

                    document.getElementById("pic_0").style.border="0px";
                }
            );
        </script>
        <style type="text/css">
            p{
                color: blue;
            }
        </style>
    </head>
    <body>
        <p>Hello world!</p>
        <p>
            <img 
                alt="" 
                style="border: 1px" 
                id="pic_0" 
                src="http://linuxgazette.net/144/misc/john/helloworld.png"
            >
        </p>
    </body>
</html>

Example output:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <title>New title</title>

        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">

        <style type="text/css">
            b{font-weight: bold}
        </style>

        <style type="text/css">
            p{
                color: blue;
            }
        </style>
    </head>
    <body>
        <p>Hello world!</p>
        <p>
            <img 
                alt="" 
                style="border: 0px" 
                id="pic_0" 
                src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACgAAAAoBAMAAAB+0KVeAAAAK3RFWHRDcmVhdGlvbiBUaW1lAFYgMzEgYXVnLiAyMDEyIDE3OjU4OjU1ICswMjAwWMdbPwAAAAd0SU1FB9wIHw8ABeoUyU4AAAAJcEhZcwAACxIAAAsSAdLdfvwAAAAEZ0FNQQAAsY8L/GEFAAAABlBMVEX///8AAABVwtN+AAAAXklEQVR42uWQUQ6AMAhD6Q3a+19WqsawwMf+NLEfy3iDlC7idTGQp/YglFAsUMqSwjlQOhN3mIMTHDq70SeEWBbt0EG8POWkDySvmCh/SssvNfwIfb+hFmgjFKPf6gDQBAQ368m09AAAAABJRU5ErkJggg=="
            >
        </p>
    </body>
</html>

Notice how the <title> tag changed, how border: 1px became border: 0px, how the image URL was transformed into a data URI.

For example, some of these transformations (inline CSS and <title> tag) can be observed when inspecting the document using the Google Chrome inspector.

Edit 3: Replacing external resources with on-page ones (styles and images) and removing javascript is an easy part. The hard part is computing the CSS style after running javascript.

Edit 4 Maybe this could be done using injected javascript (still need browser control though)?

解决方案

PhantomJS is a headless (GUI-less) WebKit with JavaScript API. It runs on all major platforms, as I requested in my question.

It can run Javascript scripts to control the GUI-less web browser. It has a powerful API, and lots and lots of examples.

In my spare time over the last 2-3 days I wrote the solution to my question, and it covers all requirements beautifully. I haven't found a webpage for which it wouldn't work.

.

Usage, command line:

phantomjs save_as_html.js http://stackoverflow.com/q/12215844/584490 saved.html

.

Javascript is allowed to run for n seconds after everything else loads, it should work even for web pages generated entirely by javascript.

.

Notes:

  • Where possible, XHR loading of resources is prefered over HTML5's canvas rendering because of reduced file size and preventing quality loss (reusing original files is better than anything).

  • <link> and <img> tags are kept in place, and data: URIs are used inside the href and src attributes respectively, instead of URLs. The same is true for background-image, which is read using getComputedStyle() on all DOM nodes.

  • <script> tags and event handler attributes are removed.

  • <link> tags with rel="alternative" are removed also (maybe they shouldn't be, and instead be fixed into an absolute URL, if relative).

  • <iframe> is currently not handled, and its src attribute is beeing set to about:blank.

.

Beware all cross site scripting security restrictions are lifted, so that all resources can be loaded. Make sure you don't try to save malicious webpages while using some secret credentials of your Facebook account :).

.

save_as_html.js contents:

//http://stackoverflow.com/a/12256190/584490

var page = require('webpage').create();
page.onConsoleMessage = function (msg) { console.log(msg); };

var system = require('system');
var address, output, size;


if (system.args.length!=3)
{
    console.log('Usage: save_as_html.js URL filename');
    phantom.exit(1);
}
else
{
    address = system.args[1];
    output = system.args[2];

    page.viewportSize = {    
        width: 1680, 
        height: 1050,
    };

    //SECURITY_ERR: DOM Exception 18: An attempt was made to break through the security policy of the user agent.
    //Enable cross site scripting:
    page.settings.XSSAuditingEnabled=false;
    page.settings.localToRemoteUrlAccessEnabled=true;
    page.settings.webSecurityEnabled=false;

    page.settings.userAgent="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1";
    page.settings.ignoreSslErrors=true;

    page.open(address, function (status){
        if (status!=='success')
        {
            console.log("Unable to load URL, returned status: "+status);
            phantom.exit(1);
        }
        else
        {
            window.setTimeout(function (){
                page.evaluate(function(){
                    var nodeList=document.getElementsByTagName("*");

                    var arrEventHandlerAttributes=[
                        "onblur", "onchange", "onclick", "ondblclick", "onfocus", "onkeydown", "onkeyup", "onkeypress", "onkeyup","onload",
                        "onmousedown", "onmousemove", "onmouseout", "onmouseover", "onmouseup", "onreset", "onselect", "onsubmit", "onunload"
                    ];


                    //http://stackoverflow.com/a/7372816/584490
                    var base64Encode=function(str)
                    {
                        var CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
                        var out = "", i = 0, len = str.length, c1, c2, c3;
                        while (i < len) {
                            c1 = str.charCodeAt(i++) & 0xff;
                            if (i == len) {
                                out += CHARS.charAt(c1 >> 2);
                                out += CHARS.charAt((c1 & 0x3) << 4);
                                out += "==";
                                break;
                            }
                            c2 = str.charCodeAt(i++);
                            if (i == len) {
                                out += CHARS.charAt(c1 >> 2);
                                out += CHARS.charAt(((c1 & 0x3) << 4) | ((c2 & 0xF0) >> 4));
                                out += CHARS.charAt((c2 & 0xF) << 2);
                                out += "=";
                                break;
                            }
                            c3 = str.charCodeAt(i++);
                            out += CHARS.charAt(c1 >> 2);
                            out += CHARS.charAt(((c1 & 0x3) << 4) | ((c2 & 0xF0) >> 4));
                            out += CHARS.charAt(((c2 & 0xF) << 2) | ((c3 & 0xC0) >> 6));
                            out += CHARS.charAt(c3 & 0x3F);
                        }
                        return out;
                    };


                    for(var n=nodeList.length-1; n>0; n--)
                    {
                        try
                        {
                            var el=nodeList[n];

                            if(el.nodeName=="IMG" && el.src.substr(0, 5)!="data:")
                            {
                                /*var canvas=document.createElement("canvas");

                                canvas.width=parseInt(el.width);
                                canvas.height=parseInt(el.height);

                                var ctx=canvas.getContext("2d");
                                ctx.drawImage(el, 0, 0);
                                el.src=canvas.toDataURL();*/

                                var xhr=new XMLHttpRequest();

                                xhr.open(
                                    "get",
                                    el.src,
                                    /*Asynchronous*/ false
                                );

                                xhr.overrideMimeType("text/plain; charset=x-user-defined");

                                xhr.send(null);

                                var strResponseContentType=xhr.getResponseHeader("Content-type").split(";")[0].replace(/[^a-z0-9\/-]/gi, "");
                                el.src="data:"+strResponseContentType+";base64,"+base64Encode(xhr.responseText);
                            }
                            else if(el.nodeName=="LINK")
                            {
                                if(el.rel=="alternate")
                                {
                                    el.parentNode.removeChild(el);
                                }
                                else if(el.href.substr(0, 5)!="data:")
                                {
                                    var xhr=new XMLHttpRequest();

                                    xhr.open(
                                        "get",
                                        el.href,
                                        /*Asynchronous*/ false
                                    );

                                    xhr.overrideMimeType("text/plain; charset=x-user-defined");

                                    xhr.send(null);

                                    //var strResponseContentType=xhr.getResponseHeader("Content-type").split(";")[0].replace(/[^a-z0-9\/-]/gi, "");
                                    //el.href="data:"+strResponseContentType+";base64,"+base64Encode(xhr.responseText);
                                    el.href="data:"+el.type+";base64,"+base64Encode(xhr.responseText);
                                }

                                continue;
                            }
                            else if(el.nodeName=="SCRIPT")
                            {
                                el.parentNode.removeChild(el);

                                continue;
                            }
                            else if(el.nodeName=="IFRAME")
                            {
                                el.src="about:blank";

                                continue;
                            }

                            for(var z=arrEventHandlerAttributes.length-1; z>=0; z--)
                                el.removeAttribute(arrEventHandlerAttributes[z]);

                            var strBackgroundImageURL=window.getComputedStyle(el).getPropertyValue("background-image").replace("/[\s]/g", "");
                            if(strBackgroundImageURL.substr(0, 4)=="url(" && strBackgroundImageURL.substr(4, 5)!="data:")
                            {
                                strBackgroundImageURL=strBackgroundImageURL.substr(4, strBackgroundImageURL.length-5);

                                /*var imageTemp=document.createElement("img");
                                imageTemp.src=strBackgroundImageURL;

                                imageTemp.onload=function(e){
                                    var canvas=document.createElement("canvas");

                                    canvas.width=parseInt(imageTemp.width);
                                    canvas.height=parseInt(imageTemp.height);

                                    var ctx=canvas.getContext("2d");
                                    ctx.drawImage(imageTemp, 0, 0);
                                    el.style.backgroundImage="url("+canvas.toDataURL()+")";
                                };

                                if (imageTemp.complete)
                                    imageTemp.onload();
                                */

                                var xhr=new XMLHttpRequest();

                                xhr.open(
                                    "get",
                                    strBackgroundImageURL,
                                    /*Asynchronous*/ false
                                );

                                xhr.overrideMimeType("text/plain; charset=x-user-defined");

                                xhr.send(null);

                                var strResponseContentType=xhr.getResponseHeader("Content-type").split(";")[0].replace(/[^a-z0-9\/-]/gi, "");
                                el.style.backgroundImage="url("+"data:"+strResponseContentType+";base64,"+base64Encode(xhr.responseText)+")";
                            }

                            if(el.nodeName=="A")
                            {
                                el.href="#";//TODO convert relative paths to absolute ones (keep URLs);
                                el.setAttribute("onclick", "return false;");//TODO: remove this when the above is fixed.
                            }
                            else if(el.nodeName=="FORM")
                            {
                                el.setAttribute("action", "");
                                el.setAttribute("onsubmit", "return false;");
                            }
                        }
                        catch(error)
                        {
                            //what can be done about it?
                        }
                    }
                });

                require("fs").write(output, page.content, "w");

                phantom.exit();
            }, 1000);
        }
    });
}

这篇关于获取HTML的当前样式(可能内联的)完成渲染和完成运行脚本的页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆