PHP 5.4.16 DOMDocument删除了部分Javascript [英] PHP 5.4.16 DOMDocument removes parts of Javascript

查看:61
本文介绍了PHP 5.4.16 DOMDocument删除了部分Javascript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试将HTML页面从远程服务器加载到PHP脚本中,该脚本应使用DOMDocument类来处理HTML。但是我已经看到,DOMDocument类删除了HTML页面随附的Javascript的某些部分。有些事情是这样的:

I try to load an HTML page from a remote server into a PHP script, which should manipulate the HTML with the DOMDocument class. But I have seen, that the DOMDocument class removes some parts of the Javascript, which comes with the HTML page. There are some things like:

<script type="text/javascript">
//...
function printJSPage() {
    var printwin=window.open('','haha','top=100,left=100,width=800,height=600');
    printwin.document.writeln(' <table border="0" cellspacing="5" cellpadding="0" width="100%">');
    printwin.document.writeln(' <tr>');
    printwin.document.writeln(' <td align="left" valign="bottom">');
    //...
    printwin.document.writeln('</td>');
    //...
}
</script>

但是DOMDocument发生了变化,即行

But the DOMDocument changes i.e. the line

printwin.document.writeln('</td>');

printwin.document.writeln(' ');

以及其他很多东西(例如,最后一个脚本标记不再存在。)我得到了一个完整的销毁页面,无法再发送了。

and also a lot of others things (i.e. the last script tag is no longer there. As the result I get a complete destroyed page, which I cannot send further.

因此,我认为DOMDocument在Javascript代码中的HTML标签存在问题,并尝试更正该代码,可以生成一个格式正确的文档。可以阻止DOMDocument中的Javascript解析吗?

So I think, DOMDocument has problems with the HTML tags within the Javascript code and tries to correct the code, to produce a well-formed document. Can I prevent the Javascript parsing within DOMDocument?

PHP代码片段为:

$stdin = file_get_contents('php://stdin');
$dom = new \DOMDocument();
@$dom->loadHTML($stdin);
return $dom->saveHTML();   // will produce wrong HTML
//return $stdin;           // will produce correct HTML

我已经存储了两个HTML版本,并且都将它们与Meld进行了比较。

I have stored both HTML versions and have compared both with Meld.

I

@$dom->loadXML($stdin);
return $dom->saveHTML();

但我什么都没得到

推荐答案

这是一个可能有用的技巧。这个想法是用保证有效的HTML并且唯一的字符串替换脚本内容,然后将其替换。

Here's a hack that might be helpful. The idea is to replace the script contents with a string that's guaranteed to be valid HTML and unique then replace it back.

它将脚本标签中的所有内容替换为MD5,

It replaces all contents inside script tags with the MD5 of those contents and then replaces them back.

$scriptContainer = [];
$str = preg_replace_callback ("#<script([^>]*)>(.*?)</script>#s", function ($matches) use (&$scriptContainer) {
     $scriptContainer[md5($matches[2])] = $matches[2];
        return "<script".$matches[1].">".md5($matches[2])."</script>";
    }, $str);
$dom = new \DOMDocument();
@$dom->loadHTML($str);
$final = strtr($dom->saveHTML(), $scriptContainer); 

在这里 strtr 由于使用 str_replace(array_keys($ scriptContainer),$ scriptContainer,$ dom-> saveHTML())格式化数组的方式也可以。

Here strtr is just convenient due to the way the array is formatted, using str_replace(array_keys($scriptContainer), $scriptContainer, $dom->saveHTML()) would also work.

我非常惊讶PHP无法正确解析HTML内容。它似乎是在解析XML内容(这也是错误的,因为CDATA内容是解析的,而不是按字面值处理的)。但是就是这样,如果您想要一个真正的文档解析器,那么您可能应该使用 jsdom

I find it very suprising that PHP does not properly parse HTML content. It seems to instead be parsing XML content (wrongly so as well because CDATA content is parsed instead of being treated literally). However it is what it is and if you want a real document parser then you should probably look into a Node.js solution with jsdom

这篇关于PHP 5.4.16 DOMDocument删除了部分Javascript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆