通过PHP解码数字html实体 [英] Decoding numeric html entities via PHP

查看:321
本文介绍了通过PHP解码数字html实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我试图转换这个字符:


’





'


)。 (我检查了页面的源代码,页面有正确的utf8字符集头/元标记)。



有人知道代码是什么?

  function entity_decode($ string,$ quote_style = ENT_COMPAT,$ charset =UTF-8){
$ string = html_entity_decode($ string,$ quote_style,$ charset);

$ string = preg_replace_callback('〜& #x([0-9a-fA-F] +);〜i',chr_utf8_callback,$ string);
$ string = preg_replace('〜&#([0-9] +);〜e','chr_utf8(\\1)',$ string);

//这是另一个方法,也不工作..
// $ string = preg_replace_callback(/(\&#[0-9] +;) /,entity_decode_callback,$ string);

return $ string;
}




function chr_utf8_callback($ matches){
return chr_utf8(hexdec($ matches [1]));
}

函数chr_utf8($ num){
if($ num <128)return chr($ num);
if($ num< 2048)return chr(($ num>> 6)+ 192)。 chr(($ num& 63)+ 128);
if($ num< 65536)return chr(($ num>> 12)+ 224)。 chr((($ num> 6)& 63)+ 128)。 chr(($ num& 63)+ 128);
if($ num< 2097152)return chr(($ num>> 18)+ 240)。 chr((($ num>> 12)& 63)+ 128)。 chr((($ num> 6)& 63)+ 128)。 chr(($ num& 63)+ 128);
return'';
}

function entity_decode_callback($ m){
return mb_convert_encoding($ m [1],UTF-8,HTML-ENTITIES);
}

echo'='。 entity_decode('&#146;');


解决方案

html_entity_decode 已执行您要查找的操作:

  $ string ='&#146;'; 

echo html_entity_decode($ string,ENT_COMPAT,'UTF-8');

它将返回字符:

 'binary hex:c292 

这是禁止使用两个(U + 0092)。由于是私人使用,您的 PHP配置/版本/编译可能不会返回它。



也有一些怪癖: p>


但是在HTML(除了使用XML规则的XHTML之外),它是一个长期的浏览器怪癖,字符引用范围&#128; 到&#159; 被误解为意味着与Windows中的字节128到159相关的字符代码页(cp1252),而不是带有这些代码点的Unicode字符。 HTML5标准最终记录了此行为。


请参阅:&#146;在ruby on rails 中被nokogiri转换为\\\’


I have this code to decode numeric html entities to the UTF8 equivalent character.

I'm trying to convert this character:

&#146;

which should output:

However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).

Does anyone know what is wrong with the code?

function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {    
     $string = html_entity_decode($string, $quote_style, $charset);

     $string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
     $string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\\1")', $string);

    //this is another method, which also doesn't work.. 
     //$string = preg_replace_callback("/(\&#[0-9]+;)/", "entity_decode_callback", $string);

     return $string; 
}




function chr_utf8_callback($matches) { 
     return chr_utf8(hexdec($matches[1])); 
}

function chr_utf8($num) {   
     if ($num < 128) return chr($num);
     if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
     if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     return '';
}

function entity_decode_callback($m) { 
     return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); 
} 

 echo '=' . entity_decode('&#146;');

解决方案

html_entity_decode already does what you're looking for:

$string = '&#146;';

echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');

It will return the character:

’   binary hex: c292

Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.

Also there are some more quirks:

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range &#128; to &#159; are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

See: &#146; is getting converted as "\u0092" by nokogiri in ruby on rails

这篇关于通过PHP解码数字html实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆