如何保持中文或其他外语的原样,而不是将其转换为代码? [英] How to keep the Chinese or other foreign language as they are instead of converting them into codes?

查看:91
本文介绍了如何保持中文或其他外语的原样,而不是将其转换为代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

DOMDocument似乎将汉字转换为代码,例如,

DOMDocument seems to convert Chinese characters into codes, for instance,

你的乱发将成为ä½ çš„ä¹±å‘

如何保持中文或其他外语的原样,而不是将其转换为代码?

How can I keep the Chinese or other foreign language as they are instead of converting them into codes?

下面是我的简单测试,

$dom = new DOMDocument();
$dom->loadHTML($html);

如果我在loadHTML()之前添加以下内容,

If I add this below before loadHTML(),

$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8"); 

我明白了

你的乱发

即使隐藏的代码将显示为汉字,但你的乱发仍然不是你的乱发我想要的....

Even though the coverted codes will be displayed as Chinese characters, 你的乱发 still are not 你的乱发 what I am after....

推荐答案

DOMDocument似乎将汉字转换为代码[...].我该如何保持中文或其他外语的原样,而不是将其转换为代码?

DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the Chinese or other foreign language as they are instead of converting them into codes?

$dom = new DOMDocument();
$dom->loadHTML($html);

如果您正在使用 loadHTML 函数来加载HTML块.默认情况下,DOMDocument期望该字符串采用HTML的默认编码(ISO-8859-1),但是最常见的字符集(sic!)是元信息,位于您正在使用的字符串旁边而不是内部.为了使这一点更加复杂,该元信息甚至应位于字符串内部.

If you're using the loadHTML function to load a HTML chunk. By default DOMDocument expects that string to be in HTML's default encoding (ISO-8859-1) however most often the charset (sic!) is meta-information provided next to the string you're using and not inside. To make this more complicated, that meta-information be be even inside the string.

无论如何,由于您没有共享HTML的字符串数据,也没有指定编码,因此很难明确说明正在发生什么.

Anyway as you have not shared the string data of the HTML and you have not specified the encoding, it's hard to tell specifically what is going on.

我认为HTML是UTF-8编码的,但是在HTML字符串中未用信号表示.因此,以下变通办法可以提供帮助:

I assume the HTML is UTF-8 encoded but this is not signalled within the HTML string. So the following work-around can help:

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

它从一开始就插入一个编码提示(并在HTML加载后将其删除).从那时起,DOMDocument将返回UTF-8(一如既往).

It injects an encoding hint on the very beginning (and removes it after the HTML has been loaded). From that point on, DOMDocument will return UTF-8 (as always).

这篇关于如何保持中文或其他外语的原样,而不是将其转换为代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆