PHP DOMDocument 无法处理 utf-8 字符 (☆) [英] PHP DOMDocument failing to handle utf-8 characters (☆)

查看:23
本文介绍了PHP DOMDocument 无法处理 utf-8 字符 (☆)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

网络服务器以 utf-8 编码提供响应,所有文件都以 utf-8 编码保存,我知道的所有设置都已设置为 utf-8 编码.

The webserver is serving responses with utf-8 encoding, all files are saved with utf-8 encoding, and everything I know of setting has been set to utf-8 encoding.

这是一个快速程序,用于测试输出是否有效:

Here's a quick program, to test if the output works:

<?php
$html = <<<HTML
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Test!</title>
</head>
<body>
    <h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;

$dom = new DOMDocument("1.0", "utf-8");
$dom->loadHTML($html);

header("Content-Type: text/html; charset=utf-8");
echo($dom->saveHTML());

程序的输出为:

<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
    <h1>&acirc;&#152;&#134; Hello &acirc;&#152;&#134; World &acirc;&#152;&#134;</h1>
</body></html>

呈现为:

我可能做错了什么?要告诉 DOMDocument 正确处理 utf-8,我需要具体多少?

What could I be doing wrong? How much more specific do I have to be to tell the DOMDocument to handle utf-8 properly?

推荐答案

DOMDocument::loadHTML() 需要一个 HTML 字符串.

DOMDocument::loadHTML() expects a HTML string.

HTML 使用 ISO-8859-1 编码(ISO 拉丁字母第 1 号)作为其规范的默认编码.这是因为更长的时间,请参阅 6.1.HTML 文档字符集.实际上,这更多是常见网络浏览器对 Windows-1252 的默认支持.

HTML uses the ISO-8859-1 encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252 in common webbrowsers.

我回过头来是因为 PHP 的 DOMDocument 基于 libxml 并且带来了 HTMLparser 专为 HTML 4.0 设计.

I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.

我认为可以安全地假设您可以加载 ISO-8859-1 编码的字符串.

I'd say it's safe to assume then that you can load an ISO-8859-1 encoded string.

您的字符串是 UTF-8 编码的.将所有高于 127/h7F 的字符转换为 HTML 实体 就可以了.如果您不想自己做,那就是 mb_convert_encodingHTML-ENTITIES 目标编码所做的:

Your string is UTF-8 encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding with the HTML-ENTITIES target encoding does:

  • 那些具有命名实体的字符,将获得命名实体.<代码>€ ->&欧元;
  • 其他人获得他们的数字(十进制)实体,例如<代码>☆ ->&#9734;

以下代码示例通过使用回调函数使进度更加明显:

The following is a code example that makes the progress a bit more visible by using a callback function:

$html = preg_replace_callback('/[x{80}-x{10FFFF}]/u', function($match) {
    list($utf8) = $match;
    $entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
    printf("%s -> %s
", $utf8, $entity);
    return $entity;
}, $html);

您的字符串的示例输出:

This exemplary outputs for your string:

☆ -> &#9734;
☆ -> &#9734;
☆ -> &#9734;

无论如何,这只是为了更深入地了解您的字符串.您希望将其转换为 loadHTML 可以处理的编码.这可以通过将 US-ASCII 之外的所有内容转换为 HTML 实体来实现:

Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML can deal with. That can be done by converting all outside of US-ASCII into HTML Entities:

$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');

请注意您的输入实际上是 UTF-8 编码的.如果您甚至有混合编码(某些输入可能会发生这种情况)mb_convert_encoding 只能处理每个字符串一种编码.我已经在上面概述了如何在正则表达式的帮助下更具体地进行字符串替换,所以我现在留下更多细节.

Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.

另一种选择是提示编码.这可以通过修改文档并添加

The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a

<meta http-equiv="content-type" content="text/html; charset=utf-8">

这是一个指定字符集的内容类型.对于无法通过网络服务器使用的 HTML 字符串(例如,保存在磁盘上或在您的示例中的字符串中),这也是最佳实践.网络服务器通常将其设置为响应标头.

which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.

如果你不关心错位的警告,你可以把它加在字符串前面:

If you don't care the misplaced warnings, you can just add it in front of the string:

$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

根据 HTML 2.0 规范,只能出现在文档的 部分的元素将自动放置在那里.这也是这里发生的事情.输出(漂亮的打印):

Per the HTML 2.0 specs, elements that can only appear in the <head> section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <meta charset="utf-8">
    <title>Test!</title>
  </head>
  <body>
    <h1>☆ Hello ☆ World ☆</h1>    
  </body>
</html>

这篇关于PHP DOMDocument 无法处理 utf-8 字符 (☆)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆