指定不带meta标记的PHP的DOMDocument的UTF-8编码 [英] Specify UTF-8 encoding to PHP's DOMDocument without meta tag

查看:49
本文介绍了指定不带meta标记的PHP的DOMDocument的UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下HTML代码,由 script 标记中的中文单词和 code 内的一些 HTML代码组成标签.

I've the following HTML code that consists of a Chinese word inside a script tag and some HTML code inside code tag.

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "訂閱最新指南";
        </script>
    </head>
    <body>
        <pre>
            <code>&lt;img src="cat.jpg"/></code>
        </pre>
        <p>The code for new line is <code>&lt;br/></code> in HTML.</p>
    </body>
</html>
EOD;

我正在通过PHP的 DOMDocument 解析此代码.在 saveHTML()之后,汉字以某种方式转换为一些奇怪的字符.我发现的唯一解决方案是添加< meta http-equiv ="Content-Type"content =" text/html;charset = utf-8";/> 添加到< head> 标记.

I'm parsing this code via PHP's DOMDocument. After saveHTML(), the chinese characters somehow converts to some weird characters. The only solution I found is to add <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> to the <head> tag.

还有其他方法可以在不添加此meta标签的情况下指定UTF-8编码吗?

Is there any other way to specify UTF-8 encoding without adding this meta tag?

这是我尝试过的所有方法(都不起作用)

Here is what all I've tried (none of them work):

// Default way. Chinese characters got encoded
$doc = new DOMDocument();
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Passed UTF-8 as parameter. Chinese characters got encoded
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Set encoding. Chinese characters got encoded
$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Using mb_convert_encoding. Chinese characters got encoded
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$doc->loadHTML($html);
echo $doc->saveHTML() . PHP_EOL . PHP_EOL;

// Use html_entity_decode to decode. But also enocdes string inside code tag
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$doc->loadHTML($html);
echo html_entity_decode($doc->saveHTML()) . PHP_EOL . PHP_EOL;

推荐答案

如果要无条件覆盖UTF-8编码,可以通过在文件前添加UTF-8 BOM来实现:

If you want to unconditionally override the encoding to UTF-8, you can do it by prepending the UTF-8 BOM to the file:

$doc = new DOMDocument();
$doc->loadHTML(str_starts_with($html, "\xEF\xBB\xBF")
    ? $html : ("\xEF\xBB\xBF" . $html));

条件表达式是必需的,因为如果开头出现双BOM,则库将发出警告.

The conditional expression is necessary because the library emits warnings if a double BOM is present at the beginning.

如果您只想使用UTF-8作为默认编码而不是latin1,则没有干净的方法可以做到这一点.不过,您可以使用以下肮脏的技巧:

If you merely want to have UTF-8 as the default encoding instead of latin1, there is no clean way to do that. You can use the following dirty hack, though:

$doc = new DOMDocument();
$doc->loadHTML($html);
if ($doc->encoding === null) {
    $doc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
    $node = $doc->firstChild;
    while (!($node instanceof DOMProcessingInstruction)) {
        $node = $node->nextSibling;
    }
    $node->parentNode->removeChild($node);
}

上面的方法有一个不幸的副作用,当文件中缺少编码声明时,解析时间实际上会加倍.(还请注意,HTML规范并没有规定查看<?xml?> 处理指令以检测字符编码,这意味着此解决方法依赖于与规范相反的功能.)

The above has the unfortunate side effect that when the encoding declaration is missing from the file, the parse time is effectively doubled. (Also note that the HTML specification does not prescribe looking at <?xml ?> processing instructions to detect the character encoding, meaning this workaround relies on functionality contrary to the specification.)

要确保在序列化到标记期间不会损坏字符,请使用 $ doc-> saveHTML($ doc)而不是 $ doc-> saveHTML()>.即使文档包含指定不同编码的声明,也始终会产生UTF-8文本.要以另一种编码方式获取文档,则必须随后对其进行转换,例如通过执行 mb_convert_encoding($ doc-> saveHTML($ doc),$ doc-> xmlEncoding,'utf-8')(应将 转换为原始编码,尽管即使这样仍可能与实际DOM树中的< meta> 元素相矛盾).

To make sure characters are not mangled during serialisation back to markup, use $doc->saveHTML($doc) instead of $doc->saveHTML(). This will always result in UTF-8 text, even if the document contains a declaration specifying a different encoding. To obtain the document in another encoding, you will have to convert it afterwards, for example by doing mb_convert_encoding($doc->saveHTML($doc), $doc->xmlEncoding, 'utf-8') (which should convert to the original encoding, although even this may still contradict a <meta> element found in the actual DOM tree).

鉴于使用 DOMDocument 和任何接近可靠性的方法所必需的解决方法,我强烈建议切换到另一个解析器.最好也使用另一种编程语言.

Given the number of workarounds necessary to use DOMDocument with anything approaching reliability, I’d strongly suggest switching to another parser. Preferably, to another programming language too.

这篇关于指定不带meta标记的PHP的DOMDocument的UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆