html_entity_decode-字符编码问题 [英] html_entity_decode - character encoding issue

查看:291
本文介绍了html_entity_decode-字符编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在字符编码方面遇到问题.我已将其简化为以下脚本:

I am having issues with character encoding. I have simplified it to this below script:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<?php

$string = 'Stan&#146;s';

echo $string.'<br><br>'; // Stan's

echo html_entity_decode($string).'<br><br>'; // Stan's

echo html_entity_decode($string, ENT_QUOTES, 'UTF-8'); // Stans

?>
</body>
</html>

我想利用最后一个echo.但是,它删除了',为什么?

I would like to make use of the last echo. However, it removes the ', why?

我已经尝试了所有三个选项ENT_COMPATENT_QUOTESENT_NOQUOTES,并且在所有情况下都删除了'.

I have tried all three options ENT_COMPAT, ENT_QUOTES, ENT_NOQUOTES and it removes the ' in all cases.

推荐答案

问题是&#146;解码为Unicode字符U + 0092,UTF-8 C2 92,称为私人使用两个":

The problem is that &#146; decodes to the Unicode character U+0092, UTF-8 C2 92, known as PRIVATE USE TWO:

$ php test.php | xxd
0000000: 5374 616e c292 73                        Stan..s

即,这不会解码为通常的撇号.

I.e., this doesn't decode to a usual apostrophe.

html_entity_decode($string)之所以起作用,是因为它实际上并未解码实体,因为默认的目标字符集是latin-1,无法表示此字符.如果将UTF-8指定为目标字符集,则实际上是对该实体进行了解码.

html_entity_decode($string) works because it doesn't actually decode the entity, since the default target charset is latin-1, which cannot represent this character. If you specify UTF-8 as the target charset, the entity is actually decoded.

该实体的目标是Windows-1252字符集:

The target of that entity is the Windows-1252 charset:

echo iconv('cp1252', 'UTF-8', html_entity_decode('Stan&#146;s', ENT_QUOTES, 'cp1252'));

Stan’s

引用维基百科:

数字引用始终引用Unicode代码点,而不考虑页面的编码.禁止使用引用永久性未定义字符和控制字符的数字引用,但换行符,制表符和回车符除外.也就是说,十六进制范围为00–08、0B–0C,0E–1F,7F和80–9F的字符不能在HTML文档中使用,甚至不能通过引用使用,因此例如不允许&#153; .但是,为了向后兼容早期的HTML作者和忽略了此限制的浏览器,某些浏览器将80-9F范围内的原始字符和数字字符引用解释为表示映射到Windows-1252编码中字节80-9F的字符.

Numeric references always refer to Unicode code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so &#153;, for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.

因此,您在这里处理的是旧式HTML实体,PHP显然无法像某些浏览器"那样处理这种方式.您可能要检查解码后的实体是否在上面指定的范围内,您是否要在Windows-1252中对其进行转码/重新解码,然后将其转换为UTF-8.或要求您的用户传递有效的HTML.

So you're dealing with legacy HTML entities here, which PHP apparently doesn't handle the same way "some" browsers do. You may want to check if the decoded entities are in the range specified above, that you transcode/redecode them in Windows-1252, then convert them to UTF-8. Or require your users to pass valid HTML.

此函数应同时处理旧版和常规HTML实体:

This function should handle both legacy and regular HTML entities:

function legacy_html_entity_decode($str, $quotes = ENT_QUOTES, $charset = 'UTF-8') {
    return preg_replace_callback('/&#(\d+);/', function ($m) use ($quotes, $charset) {
        if (0x80 <= $m[1] && $m[1] <= 0x9F) {
            return iconv('cp1252', $charset, html_entity_decode($m[0], $quotes, 'cp1252'));
        }
        return html_entity_decode($m[0], $quotes, $charset);
    }, $str);
}

这篇关于html_entity_decode-字符编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆