PHP DomDocument无法处理utf-8字符(☆) [英] PHP DomDocument failing to handle utf-8 characters (☆)

查看:82
本文介绍了PHP DomDocument无法处理utf-8字符(☆)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

网络服务器正在以utf-8编码提供响应,所有文件都以utf-8编码保存,我所知道的所有设置都已设置为utf-8编码.

The webserver is serving responses with utf-8 encoding, all files are saved with utf-8 encoding, and everything I know of setting has been set to utf-8 encoding.

这是一个快速的程序,用于测试输出是否有效:

Here's a quick program, to test if the output works:

<?php
$html = <<<HTML
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Test!</title>
</head>
<body>
    <h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;

$dom = new DomDocument("1.0", "utf-8");
$dom->loadHTML($html);

header("Content-Type: text/html; charset=utf-8");
echo($dom->saveHTML());

程序的输出为:

<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
    <h1>&acirc;&#152;&#134; Hello &acirc;&#152;&#134; World &acirc;&#152;&#134;</h1>
</body></html>

哪个呈现为:

我可能做错了什么?我必须告诉DomDocument正确处理utf-8多少次?

What could I be doing wrong? How much more specific do I have to be to tell the DomDocument to handle utf-8 properly?

推荐答案

DOMDocument::loadHTML() 需要一个HTML字符串.

DOMDocument::loadHTML() expects a HTML string.

HTML根据规格使用ISO-8859-1编码(ISO拉丁字母1)作为默认设置.那是因为更长,请参见 6.1. HTML文档字符集 .实际上,这更是普通Web浏览器中对Windows-1252的默认支持.

HTML uses the ISO-8859-1 encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252 in common webbrowsers.

之所以这么回事,是因为PHP的DOMDocument是基于libxml的,它带来了 HTMLparser 专为HTML 4.0设计.

I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.

我想可以假设您可以加载ISO-8859-1编码的字符串.

I'd say it's safe to assume then that you can load an ISO-8859-1 encoded string.

您的字符串是UTF-8编码的.将所有高于127/h7F的字符转换为 HTML实体,就可以了.如果您不想自己做,那是mb_convert_encoding目标编码为mb_convert_encoding的事情:

Your string is UTF-8 encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding with the HTML-ENTITIES target encoding does:

  • 那些已命名实体的字符将获得命名实体. € -> &euro;
  • 其他人获得其数字(十进制)实体,例如☆ -> &#9734;
  • Those characters that have named entities, will get the named entitiy. € -> &euro;
  • The others get their numeric (decimal) entity, e.g. ☆ -> &#9734;

下面是一个代码示例,该示例通过使用回调函数使进度更加明显:

The following is a code example that makes the progress a bit more visible by using a callback function:

$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
    list($utf8) = $match;
    $entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
    printf("%s -> %s\n", $utf8, $entity);
    return $entity;
}, $html);

此示例输出为您的字符串:

This exemplary outputs for your string:

☆ -> &#9734;
☆ -> &#9734;
☆ -> &#9734;

无论如何,那只是为了更深入地研究您的字符串.您希望将其转换为loadHTML可以处理的编码.可以通过将US-ASCII之外的所有内容转换为HTML实体来实现:

Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML can deal with. That can be done by converting all outside of US-ASCII into HTML Entities:

$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');

请确保您的输入实际上是UTF-8编码的.如果您甚至使用混合编码(某些输入可能会发生这种情况),则mb_convert_encoding每个字符串只能处理一种编码.我已经在上面概述了如何在正则表达式的帮助下更具体地进行字符串替换,因此现在我将进一步详细介绍.

Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.

另一种选择是提示编码.您可以通过修改文档并添加

The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a

<meta http-equiv="content-type" content="text/html; charset=utf-8">

,它是指定字符集的Content-Type.对于无法通过网络服务器使用的HTML字符串(例如,保存在磁盘上或如示例中的字符串中),这也是最佳做法. Web服务器通常将其设置为响应标头.

which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.

如果您不在乎错放的警告,则可以将其添加到字符串的前面:

If you don't care the misplaced warnings, you can just add it in front of the string:

$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

根据HTML 2.0规范,仅会出现在文档<head>部分中的元素将自动放置在该位置.这也是这里发生的情况.输出(漂亮打印):

Per the HTML 2.0 specs, elements that can only appear in the <head> section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <meta charset="utf-8">
    <title>Test!</title>
  </head>
  <body>
    <h1>☆ Hello ☆ World ☆</h1>    
  </body>
</html>

这篇关于PHP DomDocument无法处理utf-8字符(☆)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆