PHP DOMDocument:解析未转义的字符串时出错 [英] PHP DOMDocument: Errors while parsing unescaped strings

查看:59
本文介绍了PHP DOMDocument:解析未转义的字符串时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用PHP的 DOMDocument 解析HTML时遇到问题.

I'm having an issue while parsing HTML with PHP's DOMDocument.

我正在解析的HMTL具有以下脚本标签:

The HMTL i'm parsing has the following script tag:

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

此代码段存在两个问题:

This snippet has two problems:

1) buttonWithCountTemplate var内部的HTML不会转义.DOMDocument可以正确地对此进行管理,在解析时会转义字符.没问题.

1) The HTML inside the buttonWithCountTemplate var is not escaped. DOMDocument manages this correctly, escaping the characters when parsing it. Not a problem.

2)在结尾附近,有一个带有未转义结束标签的img标签:

2) Near the end, there's a img tag with an unescaped closing tag:

<img src="$iconImg" />

/> 使DOMDocument认为脚本已完成,但它缺少结束标记.如果您使用getElementByTagName提取脚本,则会在img标签处关闭该标签,其余的将显示为HTML上的文本.

The /> makes DOMDocument think that the script is finished but it lacks the closing tag. If you extract the script using getElementByTagName you'll get the tag closed at this img tag, and the rest will appear as text on the HTML.

我的目标是删除此页面中的所有脚本,因此,如果我对此标签执行 removeChild(),该标签将被删除,但以下部分将在呈现页面时显示为文本:

My goal is to remove all scripts in this page, so if I do a removeChild() over this tag, the tag is removed but the following part appears as text when rendering the page:

</div><div class="sCountBox">$count</div></a></div>',
        }
    </script>

修复HTML并不是解决方案,因为我正在开发通用解析器,并且需要处理所有类型的HTML.

Fixing the HTML is not a solution because I'm developing a generic parser and needs to handle all types of HTML.

我的问题是,是否应该在将HTML提供给DOMDocument之前进行任何清理,或者是否可以在DOMDocument上启用以避免触发此问题的选项,或者即使我可以在加载HTML之前剥离所有标签.

My question is if I should do any sanitization before feeding the HTML to DOMDocument, or if there's an option to enable on DOMDocument to avoid triggering this issue, or even if I can strip all tags before loading the HTML.

有什么想法吗?

经过研究,我发现了DOMDocument解析器的真正问题.考虑以下HTML:

After some research, I found out the real problem of the DOMDocument parser. Consider the following HTML:

<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
       var test = '</div>';
       // I should not appear on the result
</script>

使用以下php代码删除脚本标签(基于Gholizadeh的回答):

Using the following php code to remove script tags (based on Gholizadeh's answer):

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
libxml_use_internal_errors(true);
$dom->loadHTML(file_get_contents('js.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
//@$dom->loadHTMLFile('script.html'); //fix tags if not exist

while($nodes = $dom->getElementsByTagName("script")) {
    if($nodes->length == 0) break;
    $script = $nodes->item(0);
    $script->parentNode->removeChild($script);
}

//return $dom->saveHTML();
$final = $dom->saveHTML();
echo $final;

结果如下:

<div> <!-- Offending div without closing tag -->
<p>';
       // I should not appear on the result
</p></div>

问题在于第一个div标签没有关闭,并且似乎DOMDocument将JS字符串内的div标签作为html而不是简单的JS字符串.

The problem is that the first div tag is not closed and seems that DOMDocument takes the div tags inside the JS string as html instead of a simple JS string.

我该怎么做才能解决这个问题?请记住,因为我正在开发通用解析器,所以修改HTML并不是一种选择.

What can I do to solve this? Remember that modifing the HTML is not an option, since I'm developing a generic parser.

推荐答案

我在html文件上测试了以下代码,如下所示:

I tested the following code on a html file like this:

<p>some text 1</p>
<img src="http://www.example.com/images/some_image_1.jpg">
<p>some text 2</p>
<p>some text 3</p>
<img src="http://www.example.com/images/some_image_2.jpg">

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

<p>some text 4</p>
<p>some text 5</p>
<img src="http://www.example.com/images/some_image_3.jpg">

php代码是:

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

    $dom = new DOMDocument;
    $dom->preserveWhiteSpace = false;
    @$dom->loadHTML(file_get_contents('script.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    //@$dom->loadHTMLFile('script.html'); //fix tags if not exist 

    $nodes = $dom->getElementsByTagName("script");

    foreach($nodes as $i => $node){
        $script = $nodes->item($i);
        $script->parentNode->removeChild($script);
    }

    //return $dom->saveHTML();
    $dom->saveHtmlFile('script.html');

它适用于给定的示例,我认为您应该使用加载html代码时使用的选项.

and it works on the given example I think you should use options I used in loading html code.

根据最近的问题更新进行

实际上,您无法使用正则表达式解析[X] HTML(请阅读

Actually You can't parse [X]HTML with regex (read this link for more information) but if your only purpose is to remove just script tags and you can make sure there is no </script> tag as a string between it. you can use this regex:

$html = mb_convert_encoding(file_get_contents('script2.html'), 'HTML-ENTITIES', 'UTF-8');
$new_html = preg_replace('/<script(.*?)>(.*?)<\/script>/si', '', $html);
file_put_contents('script-result.html', $new_html);

坦率地说,问题在于您可能没有标准的HTML代码.但我认为最好尝试链接

frankly the problem is you may have not a standard HTML code. but I think it's better to try other libraries linked here.

否则,我想您应该编写一个特殊的解析器来删除脚本标记,并注意其中的单引号和双引号.

otherwise I guess you should write a special parser to remove script tag and take care of single quote and double quotes inside.

这篇关于PHP DOMDocument:解析未转义的字符串时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆