通过 PHP 解码数字 html 实体 [英] Decoding numeric html entities via PHP

查看:29
本文介绍了通过 PHP 解码数字 html 实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个代码来将数字 html 实体解码为 UTF8 等效字符.

I have this code to decode numeric html entities to the UTF8 equivalent character.

我正在尝试转换这个字符:

I'm trying to convert this character:

’

应该输出:

然而,它只是消失了(没有输出).(我已经检查了页面的源代码,该页面具有正确的 utf8 字符集标题/元标记).

However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).

有人知道代码有什么问题吗?

Does anyone know what is wrong with the code?

function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {    
     $string = html_entity_decode($string, $quote_style, $charset);

     $string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
     $string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\1")', $string);

    //this is another method, which also doesn't work.. 
     //$string = preg_replace_callback("/(&#[0-9]+;)/", "entity_decode_callback", $string);

     return $string; 
}




function chr_utf8_callback($matches) { 
     return chr_utf8(hexdec($matches[1])); 
}

function chr_utf8($num) {   
     if ($num < 128) return chr($num);
     if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
     if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     return '';
}

function entity_decode_callback($m) { 
     return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); 
} 

 echo '=' . entity_decode('&#146;');

推荐答案

html_entity_decode 已经满足您的需求:

$string = '&#146;';

echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');

它将返回字符:

’   binary hex: c292

这是私人使用二 (U+0092).由于它是私人使用,的 PHP 配置/版本/编译可能根本不会返回它.

Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.

还有一些怪癖:

但在 HTML 中(XHTML 除外,它使用 XML 规则),这是一个长期存在的浏览器怪癖,字符引用范围为 &#128;&#159; 被误解为与 Windows 西方代码页 (cp1252) 中的字节 128 到 159 相关联的字符,而不是具有这些代码点的 Unicode 字符.HTML5 标准最终记录了这种行为.

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range &#128; to &#159; are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

参见:&#146;正在被 nokogiri 在 ruby​​ on rails 中转换为u0092"

这篇关于通过 PHP 解码数字 html 实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆