如何将HTML字符引用(ף)转换为常规UTF-8? [英] How can I convert HTML character references (ף) to regular UTF-8?

查看:204
本文介绍了如何将HTML字符引用(ף)转换为常规UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些包含字符引用的希伯来网站:נוף



如果我将文件保存为.html并以UTF-8编码进行查看,则只能查看这些字母。



如果我尝试打开它作为常规的文本文件,UTF-8编码不会显示正确的输出。



我注意到如果我打开一个文本编辑器并以UTF-8编写希伯来文,在这个例子中(ו )字符需要两个字节而不是4字节行)

任何想法是UTF-16或任何其他类型的UTF字母表示?



如果可能,我如何将它转换为普通字母?



使用最新的PHP版本。

解决方案

这些是字符引用,它通过指定十进制中该字符的代码点来引用ISO 10646中的字符(&上午p;# n ; )或十六进制(&#x n ;

您可以使用 html_entity_decode ,它解码这些字符引用以及为HTML 4定义的实体,所以其他引用(如& lt; & gt; $ b $ pre $ & amp; 也将被解码: $ str = html_entity_decode($ str,ENT_NOQUOTES,'UTF-8');

如果你只是想解码数字字符引用,你可以使用这个:

 函数html_dereference($ match){
if(strtolower($ match [1] [0])==='x') {
$ codepoint = intval(substr($ match [1],1),16);
} else {
$ codepoint = intval($ match [1],10);

返回mb_convert_encoding(pack('N',$ codepoint),'UTF-8','UTF-32BE');
}
$ str = preg_replace_callback('/&#(x [0-9a-f] + | [0-9] +); / i','html_dereference',$ str);






由于 YuriKolovsky thirtydot 另一个问题,似乎浏览器供应商默默地同意了关于字符引用映射的一些内容,这与规范不同,并且是相当无证的。

似乎有些字符引用通常会映射到 Latin 1 supplement ,但它们实际上映射到不同的字符上。这是因为映射不是映射来自Windows-1252而是映射ISO 8859-1的字符,而Unicode字符集是在该映射上构建的。 Jukka Korpela撰写了关于该主题的广泛文章



现在,这里是对上述函数的扩展,它处理了这个怪癖:

 <$ c $函数html_character_reference_decode($ string,$ encoding ='UTF-8',$ fixMappingBug = true){
$ deref = function($ match)use($ encoding,$ fixMappingBug){
if strtolower($ match [1] [0])===x){
$ codepoint = intval(substr($ match [1],1),16);
} else {
$ codepoint = intval($ match [1],10);
}
// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
if($ fixMappingBug&& $ codepoint> ; = 130& $ codepoint< = 159){
$ mapping = array(
8218,402,8222,8230,8224,8225,710,8240,352,8249,
338,141,142,143,144,8216,8217,8220,8221,8226,
8211,8212,732,8482,353,8250,339,157,158,376);以及b。
$ codepoint = $ mapping [$ codepoint-130];
}
返回mb_convert_encoding(pack(N,$ codepoint),$ encoding,UTF-32BE);
};
return preg_replace_callback('/&#(x [0-9a-f] + | [0-9] +); / i',$ deref,$ string);

如果匿名函数不可用(引入5.3.0),你也可以使用 create_function

  $ deref = create_function( '$ match','
$ encoding ='.var_export($ encoding,true)。';
$ fixMappingBug ='.var_export($ fixMappingBug,true)。';
if ($ match [1],1),16);
} else {
$ codepoint = intval($ match [1],10);
}
// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars ($ fixMappingBug&& $ codepoint> = 130&& $ codepoint< = 159){
$ mapping = array(
8218,402,8222).html
,8230,8224,8225,710,8240,352,8249,
338,141,142,143,144,82 16,8217,8220,8221,8226,
8211,8212,732,8482,353,8250,339,157,158,376);以及其中,
$ codepoint = $ mapping [$ codepoint-130];
}
返回mb_convert_encoding(pack(N,$ codepoint),$ encoding,UTF-32BE);
');






这是另一个函数,它试图符合< HTML 5的行为

 函数html5_decode($ string,$ flags = ENT_COMPAT,$ charset ='UTF-8'){
$ deref = function($ match )使用($ flags,$ charset){
if($ match [1] [0] ==='#'){
if(strtolower($ match [1] [0])= =='''){
$ codepoint = intval(substr($ match [1],2),16);
} else {
$ codepoint = intval(substr($ match [1],1),10);
}

// HTML 5特定行为
// @see http://dev.w3.org/html5/spec/tokenization.html#tokenizing-character-references

//处理Windows-1252错误映射
// @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
/ / @see http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides
$ overrides = array(
0x00 => 0xFFFD,0x80 => 0x20AC ,0x82 => 0x201A,0x83 => 0x0192,0x84 => 0x201E,
0x85 => 0x2026,0x86 => 0x2020,0x87 => 0x2021,0x88 => 0x02C6,0x89 => ; 0x2030,
0x8A => 0x0160,0x8B => 0x2039,0x8C => 0x0152,0x8E => 0x017D,0x91 => 0x2018,
0x92 => 0x2019,0x93 => ; 0x201C,0x94 => 0x201D,0x95 => 0x2022,0x96 => 0x2013,
0x97 => 0x2014,0x98 => 0x02DC,0x99 => 0x2122,0x9A => 0x0161,0x9B => 0x203A,
0x9C => 0x0153,0x9E => 0 x017E,0x9F => 0x0178);
if(isset($ windows1252Mapping [$ codepoint])){
$ codepoint = $ windows1252Mapping [$ codepoint]; ($ codepoint> = 0xD800&& $ codepoint< = 0xDFFF)|| $ codepoint> 0x10FFFF){
$ codepoint = 0xFFFD(


; ($ codepoint> = 0x0001&& $ codepoint< = 0x0008)||
($ codepoint> = 0x000E&& $ codepoint< ($ codepoint> = 0x007F&& $ codepoint< = 0x009F)||
($ codepoint> = 0xFDD0&& $ codepoint< = 0xFDEF )||
in_array($ codepoint,array(
0x000B,0xFFFE,0xFFFF,0x1FFFE,0x1FFFF,0x2FFFE,0x2FFFF,
0x3FFFE,0x3FFFF,0x4FFFE,0x4FFFF,0x5FFFE,0x5FFFF,0x6FFFE,
0x6FFFF,0x7FFFE,0x7FFFF,0x8FFFE,0x8FFFF,0x9FFFE,0x9FFFF,
0xAFFFE,0xAFFFF,0xBFFFE,0xBFFFF,0xCFFFE,0xCFFFF,0xDFFFE,
0xDFFFF,0xEFFFE,0xEFFFF,0xFFFFE,0xFFFFF, 0x10FFFE,0x10FFFF))){
$ codepoint = 0xFFFD;

return mb_convert_encoding(pack(N,$ codepoint),$ charset,UTF-32BE);
} else {
return html_entity_decode($ match [0],$ flags,$ charset);
}
};
return preg_replace_callback('/&(#(?: x [0-9a-f] + | [0-9] +)| [A-Za-z0-9] +); / i', $ deref,$ string);
}

我也注意到,在PHP 5.4.0中, html_entity_decode 函数增加了另一个名为 ENT_HTML5 对于HTML 5行为。


I have some hebrew websites that contains character references like: &#x5E0;&#x5D5;&#x5E3;

I can only view these letters if I save the file as .html and view in UTF-8 encoding.

If I try to open it as a regular text file then UTF-8 encoding does not show the proper output.

I noticed that if I open a text editor and write hebrew in UTF-8, each character takes two bytes not 4 bytes line in this example (&#x5D5;)

Any ideas if this is UTF-16 or any other kind of UTF representation of letters?

How can I convert it to normal letters if possible?

Using latest PHP version.

解决方案

Those are character references that refer to character in ISO 10646 by specifying the code point of that character in decimal (&#n;) or hexadecimal (&#xn;) notation.

You can use html_entity_decode that decodes such character references as well as the entity references for entities defined for HTML 4, so other references like &lt;, &gt;, &amp; will also get decoded:

$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');

If you just want to decode the numeric character references, you can use this:

function html_dereference($match) {
    if (strtolower($match[1][0]) === 'x') {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    return mb_convert_encoding(pack('N', $codepoint), 'UTF-8', 'UTF-32BE');
}
$str = preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', 'html_dereference', $str);


As YuriKolovsky and thirtydot have pointed out in another question, it seems that browser vendors did ‘silently’ agreed on something regarding character references mapping, that does differ from the specification and is quite undocumented.

There seem to be some character references that would normally be mapped onto the Latin 1 supplement but that are actually mapped onto different characters. This is due the mapping that would rather result from mapping the characters from Windows-1252 instead of ISO 8859-1, on which the Unicode character set is build on. Jukka Korpela wrote an extensive article on this topic.

Now here’s an extension to the function mentioned above that handles this quirk:

function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true) {
    $deref = function($match) use ($encoding, $fixMappingBug) {
        if (strtolower($match[1][0]) === "x") {
            $codepoint = intval(substr($match[1], 1), 16);
        } else {
            $codepoint = intval($match[1], 10);
        }
        // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
        if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
            $mapping = array(
                8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
                338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
                8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
            $codepoint = $mapping[$codepoint-130];
        }
        return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
    };
    return preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', $deref, $string);
}

If anonymous functions are not available (introduced with 5.3.0), you could also use create_function:

$deref = create_function('$match', '
    $encoding = '.var_export($encoding, true).';
    $fixMappingBug = '.var_export($fixMappingBug, true).';
    if (strtolower($match[1][0]) === "x") {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
    if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
        $mapping = array(
            8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
            338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
            8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
        $codepoint = $mapping[$codepoint-130];
    }
    return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
');


Here’s another function that tries to comply to the behavior of HTML 5:

function html5_decode($string, $flags=ENT_COMPAT, $charset='UTF-8') {
    $deref = function($match) use ($flags, $charset) {
        if ($match[1][0] === '#') {
            if (strtolower($match[1][0]) === '#') {
                $codepoint = intval(substr($match[1], 2), 16);
            } else {
                $codepoint = intval(substr($match[1], 1), 10);
            }

            // HTML 5 specific behavior
            // @see http://dev.w3.org/html5/spec/tokenization.html#tokenizing-character-references

            // handle Windows-1252 mismapping
            // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
            // @see http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides
            $overrides = array(
                0x00=>0xFFFD,0x80=>0x20AC,0x82=>0x201A,0x83=>0x0192,0x84=>0x201E,
                0x85=>0x2026,0x86=>0x2020,0x87=>0x2021,0x88=>0x02C6,0x89=>0x2030,
                0x8A=>0x0160,0x8B=>0x2039,0x8C=>0x0152,0x8E=>0x017D,0x91=>0x2018,
                0x92=>0x2019,0x93=>0x201C,0x94=>0x201D,0x95=>0x2022,0x96=>0x2013,
                0x97=>0x2014,0x98=>0x02DC,0x99=>0x2122,0x9A=>0x0161,0x9B=>0x203A,
                0x9C=>0x0153,0x9E=>0x017E,0x9F=>0x0178);
            if (isset($windows1252Mapping[$codepoint])) {
                $codepoint = $windows1252Mapping[$codepoint];
            }

            if (($codepoint >= 0xD800 && $codepoint <= 0xDFFF) || $codepoint > 0x10FFFF) {
                $codepoint = 0xFFFD;
            }
            if (($codepoint >= 0x0001 && $codepoint <= 0x0008) ||
                ($codepoint >= 0x000E && $codepoint <= 0x001F) ||
                ($codepoint >= 0x007F && $codepoint <= 0x009F) ||
                ($codepoint >= 0xFDD0 && $codepoint <= 0xFDEF) ||
                in_array($codepoint, array(
                    0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
                    0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
                    0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
                    0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
                    0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF))) {
                $codepoint = 0xFFFD;
            }
            return mb_convert_encoding(pack("N", $codepoint), $charset, "UTF-32BE");
        } else {
            return html_entity_decode($match[0], $flags, $charset);
        }   
    };
    return preg_replace_callback('/&(#(?:x[0-9a-f]+|[0-9]+)|[A-Za-z0-9]+);/i', $deref, $string);
}

I’ve also noticed that in PHP 5.4.0 the html_entity_decode function was added another flag named ENT_HTML5 for HTML 5 behavior.

这篇关于如何将HTML字符引用(&#x5E3;)转换为常规UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆