UTF-8与PHP DOMDocument loadHTML吗? [英] UTF-8 with PHP DOMDocument loadHTML?

查看:137
本文介绍了UTF-8与PHP DOMDocument loadHTML吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下示例,test.php:

<?php
$mystr = "<p>Hello, με काचं  ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'utf-8'); //DOMDocument();
$domdoc->loadHTML($mystr); // already here corrupt UTF-8?
var_dump($domdoc);
?>

如果我使用PHP 5.5.9(cli)运行此程序,则会进入终端:

If I run this with PHP 5.5.9 (cli), I get in terminal:

$ php test.php 
string(50) "<p>Hello, με काचं  ça øy jeść</p>"
object(DOMDocument)#1 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
...
  ["actualEncoding"]=>
  NULL
  ["encoding"]=>
  NULL
  ["xmlEncoding"]=>
  NULL
...
  ["textContent"]=>
  string(70) "Hello, με à¤à¤¾à¤à¤  ça øy jeÅÄ"
}

很明显,原始字符串正确为UTF-8,但是DOMDocument的textContent编码错误.

Clearly, the original string is correct as UTF-8, but the textContent of the DOMDocument is incorrectly encoded.

那么,如何在DOMDocument中以正确的UTF-8格式获取内容?

So, how can I get the content as correct UTF-8 in the DOMDocument?

推荐答案

DOM扩展建立在 libxml2 上,其HTML解析器是针对HTML 4制作的-默认编码为ISO-8859- 1.除非遇到适当的元标记或XML声明,否则 loadHTML() 都将假定内容为ISO-8859-1.

The DOM extension was built on libxml2 whose HTML parser was made for HTML 4 - the default encoding for which is ISO-8859-1. Unless it encounters an appropriate meta tag or XML declaration stating otherwise loadHTML() will assume the content is ISO-8859-1.

在创建 DOMDocument 时指定编码不会影响什么解析器会执行-加载HTML(或XML)会同时替换您为其构造函数提供的xml版本和编码.

Specifying the encoding when creating the DOMDocument as you have does not influence what the parser does - loading HTML (or XML) replaces both the xml version and encoding that you gave its constructor.

首先使用 mb_convert_encoding() 来翻译高于ASCII范围等同于它的html实体.

First use mb_convert_encoding() to translate anything above the ASCII range into its html entity equivalent.

$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));

或者入侵指定UTF-8的元标记或xml声明.

Or hack in a meta tag or xml declaration specifying UTF-8.

$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);

$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);

这篇关于UTF-8与PHP DOMDocument loadHTML吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆