通过PHP解码数字html实体 [英] Decoding numeric html entities via PHP
问题描述
我试图转换这个字符:
’
'
)。 (我检查了页面的源代码,页面有正确的utf8字符集头/元标记)。
有人知道代码是什么?
function entity_decode($ string,$ quote_style = ENT_COMPAT,$ charset =UTF-8){
$ string = html_entity_decode($ string,$ quote_style,$ charset);
$ string = preg_replace_callback('〜& #x([0-9a-fA-F] +);〜i',chr_utf8_callback,$ string);
$ string = preg_replace('〜&#([0-9] +);〜e','chr_utf8(\\1)',$ string);
//这是另一个方法,也不工作..
// $ string = preg_replace_callback(/(\&#[0-9] +;) /,entity_decode_callback,$ string);
return $ string;
}
function chr_utf8_callback($ matches){
return chr_utf8(hexdec($ matches [1]));
}
函数chr_utf8($ num){
if($ num <128)return chr($ num);
if($ num< 2048)return chr(($ num>> 6)+ 192)。 chr(($ num& 63)+ 128);
if($ num< 65536)return chr(($ num>> 12)+ 224)。 chr((($ num> 6)& 63)+ 128)。 chr(($ num& 63)+ 128);
if($ num< 2097152)return chr(($ num>> 18)+ 240)。 chr((($ num>> 12)& 63)+ 128)。 chr((($ num> 6)& 63)+ 128)。 chr(($ num& 63)+ 128);
return'';
}
function entity_decode_callback($ m){
return mb_convert_encoding($ m [1],UTF-8,HTML-ENTITIES);
}
echo'='。 entity_decode('&#146;');
html_entity_decode
已执行您要查找的操作:
$ string ='&#146;';
echo html_entity_decode($ string,ENT_COMPAT,'UTF-8');
它将返回字符:
'binary hex:c292
这是禁止使用两个(U + 0092)。由于是私人使用,您的 PHP配置/版本/编译可能不会返回它。
也有一些怪癖: p>
但是在HTML(除了使用XML规则的XHTML之外),它是一个长期的浏览器怪癖,字符引用范围
&#128; 到
&#159;
被误解为意味着与Windows中的字节128到159相关的字符代码页(cp1252),而不是带有这些代码点的Unicode字符。 HTML5标准最终记录了此行为。
请参阅:&#146;在ruby on rails 中被nokogiri转换为\\\
I have this code to decode numeric html entities to the UTF8 equivalent character.
I'm trying to convert this character:
’
which should output:
However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).
Does anyone know what is wrong with the code?
function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {
$string = html_entity_decode($string, $quote_style, $charset);
$string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\\1")', $string);
//this is another method, which also doesn't work..
//$string = preg_replace_callback("/(\&#[0-9]+;)/", "entity_decode_callback", $string);
return $string;
}
function chr_utf8_callback($matches) {
return chr_utf8(hexdec($matches[1]));
}
function chr_utf8($num) {
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
function entity_decode_callback($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}
echo '=' . entity_decode('’');
html_entity_decode
already does what you're looking for:
$string = '’';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
It will return the character:
’ binary hex: c292
Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.
Also there are some more quirks:
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range
€
toŸ
are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
See: ’ is getting converted as "\u0092" by nokogiri in ruby on rails
这篇关于通过PHP解码数字html实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!