通过问号替换无效的UTF-8字符,mbstring.substitute_character似乎被忽略 [英] Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

查看:287
本文介绍了通过问号替换无效的UTF-8字符,mbstring.substitute_character似乎被忽略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用无效的UTF-8字符替换为引号(PHP 5.3.5)。



到目前为止我有这个解决方案, ,而不是由?替换。

  function replace_invalid_utf8($ str)
{
return mb_convert_encoding($ str,'UTF-8','UTF-8');
}

echo mb_substitute_character()。\\\
;

echo replace_invalid_utf8('éééaaaàààeeé')。\\\
;
echo replace_invalid_utf8('eeeaaaaaaeeé')。\\\
;

应输出:

  63 //'?'字符的ASCII代码
??? aaa ???eé//或?? aa ??eé
eeeaaaaaaeeé

但目前输出:

  63 
aaaee //删除无效字符
eeeaaaaaaeeé

任何建议? >

可以用另一种方式(例如使用 preg_replace()


$

您可以使用 mb_convert_encoding() htmlspecialchars()的 ENT_SUBSTITUTE 选项。对于cource,您也可以使用 preg_match()。如果您使用intl,则可以从PHP 5.5开始使用 UConverter



无效字节序列的推荐替代字符为 U + FFFD 。请参阅UTR#36:Unicode安全注意事项中的 3.1.2替换Ill形成的子序列

使用 mb_convert_encoding()时,您可以通过将Unicode代码点传递给 mb_substitute_character() mbstring.substitute_character 指令。替换的默认字符是? (QUESTION MARK - U + 003F)。

  // REPLACEMENT CHARACTER(U + FFFD)
mb_substitute_character(0xFFFD) ;

function replace_invalid_byte_sequence($ str)
{
return mb_convert_encoding($ str,'UTF-8','UTF-8');
}

function replace_invalid_byte_sequence2($ str)
{
return htmlspecialchars_decode(htmlspecialchars($ str,ENT_SUBSTITUTE,'UTF-8'));
}

UConverter 提供过程和面向对象API 。

  function replace_invalid_byte_sequence3($ str)
{
return UConverter :: transcode($ str,'UTF -8','UTF-8');
}

function replace_invalid_byte_sequence4($ str)
{
return(new UConverter('UTF-8','UTF-8')) - & ($ str);
}

使用 preg_match()注意避免UTF-8非最短形式的漏洞的字节范围。

 超前字节:0x00  -  0x7F,0xC2  -  0xF4 
trail byte:0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)

到以下用于检查字节范围的资源。


  1. UTF-8字节序列语法

  2. 表3-7。Unicode标准6.1中的良好形式的UTF-8字节序列

  3. W3C国际化中的多语言表单编码

字节范围表如下。

 代码点第一个字节第二个字节第三字节第四字节
U + 0000 - U + 007F 00 - 7F
U + 0080 - U + 07FF C2 - DF 80 - BF
U + 0800 - U + 0FFF E0 A0 - BF 80 - BF
U + 1000 - U + CFFF E1 - EC 80 - BF 80 - BF
U + D000 - U + D7FF ED 80 - 9F 80 - BF
U + E000 - U + FFFF EE - EF 80 - BF 80 - BF
U + 10000 - U + 3FFFF F0 90 - BF 80 - BF 80 - BF
U + 40000 - U + FFFFF F1 - F3 80 - BF 80 - BF 80 - BF
U + 100000 - U + 10FFFF F4 80 - 8F 80 - BF 80 - BF

如何替换无效字节序列而不破坏有效字符,请参见 3.1.1 Ill-Formed UTR#36中的子序列:Unicode安全注意事项和表3-8。 Unicode标准中的U + FFFD在UTF-8转换中的使用



Unicode标准显示了一个例子:

  before:< 61 F1 80 80 E1 80 C2 62 80 63 80 BF 64> 
after:< 0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064>

以下是 preg_replace_callback()上述规则。

 函数replace_invalid_byte_sequence5($ str)
{
//替换字符FFFD)
$ substitute =\xEF\xBF\xBD;
$ regex ='/
([\x00-\x7F]#U + 0000 - U + 007F
| [\xC2-\xDF] [\x80-\xBF]#U + 0080 - U + 07FF
| \xE0 [\xA0-\xBF] [\x80-\xBF]#U + 0800 - U + 0FFF
| [\xE1-\xEC\xEE\xEF] [\ x80-\xBF] {2} U + 1000 - U + CFFF
| \xED [\x80-\x9F] [\x80-\xBF]#U + D000 - U + D7FF
| \xF0 [\x90-\xBF] [\x80-\xBF] {2}#U + 10000 - U + 3FFFF
| [\xF1-\xF3] [\x80 -\xBF] {3}#U + 40000-U + FFFFF
| \xF4 [\x80-\x8F] [\x80-\xBF] {2})#U + 100000 - U + 10FFFF
|(\xE0 [\xA0-\xBF ]#U + 0800 - U + 0FFF(无效)
| [\xE1-\xEC\xEE\xEF] [\x80-\xBF]#U + 1000 - U + CFFF无效)
| \xED [\x80-\x9F]#U + D000 - U + D7FF(无效)
| \xF0 [\x90- \xBF] x80-\xBF]?#U + 10000 - U + 3FFFF(无效)
| [\xF1-\xF3] [\x80- \xBF] {1,2}#U + 40000 - U + FFFFF(invalid)
| \xF4 [\x80-\x8F] [\x80-\xBF]?)#U + 100000 - U + 10FFFF(无效)
|(。)#invalid 1-byte
/ xs';

// $ matches [1]:有效字符
// $ matches [2]:无效的3字节或4字节字符
// $ matches [3] :invalid 1-byte

$ ret = preg_replace_callback($ regex,function($ matches)use($ substitute){

if(isset($ matches [2]) || isset($ matches [3])){

return $ substitute;

}

return $ matches [1];

},$ str);

return $ ret;
}

您可以直接比较字节,避免preg_match通过这种方式限制字节大小。

  function replace_invalid_byte_sequence6($ str){

$ size = strlen($ str);
$ substitute =\xEF\xBF\xBD;
$ ret ='';

$ pos = 0;
$ char;
$ char_size;
$ valid;

while(utf8_get_next_char($ str,$ size,$ pos,$ char,$ char_size,$ valid)){
$ ret。= $ valid? $ char:$ substitute;
}

return $ ret;
}

function utf8_get_next_char($ str,$ str_size,& $ pos,& $ char,& $ char_size,& $ valid)
{
$ valid = false;

if($ str_size< = $ pos){
return false;
}

if($ str [$ pos]<\x80){

$ valid = true;
$ char_size = 1;

} else if($ str [$ pos]<\xC2){

$ char_size = 1;

} else if($ str [$ pos]<\xE0){

if(!isset($ str [$ pos + 1])| | $ str [$ pos + 1]<\x80||\xBF< $ str [$ pos + 1]){

$ char_size = 1;

} else {

$ valid = true;
$ char_size = 2;

}

} else if($ str [$ pos]<\xF0){

$ left =\ xE0=== $ str [$ pos]? \xA0:\x80;
$ right =\xED=== $ str [$ pos]? \x9F:\xBF;

if(!isset($ str [$ pos + 1])|| $ str [$ pos + 1]< $ left || $ right< $ str [$ pos + 1] ){

$ char_size = 1;

} else if(!isset($ str [$ pos + 2])|| $ str [$ pos + 2]<\x80||\xBF $ str [$ pos + 2]){

$ char_size = 2;

} else {

$ valid = true;
$ char_size = 3;

}

} else if($ str [$ pos]<\xF5){

$ left =\ xF0=== $ str [$ pos]? \x90:\x80;
$ right =\xF4=== $ str [$ pos]? \x8F:\xBF;

if(!isset($ str [$ pos + 1])|| $ str [$ pos + 1]< $ left || $ right< $ str [$ pos + 1] ){

$ char_size = 1;

} else if(!isset($ str [$ pos + 2])|| $ str [$ pos + 2]<\x80||\xBF< $ str [$ pos + 2]){

$ char_size = 2;

} else if(!isset($ str [$ pos + 3])|| $ str [$ pos + 3]<\x80||\xBF $ str [$ pos + 3]){

$ char_size = 3;

} else {

$ valid = true;
$ char_size = 4;

}

} else {

$ char_size = 1;

}

$ char = substr($ str,$ pos,$ char_size);
$ pos + = $ char_size;

return true;
}

测试用例在这里。

  function run(array $ callables,array $ arguments)
{
return array_map(function($ callable)use($ arguments){
return array_map($ callable,$ arguments);
},$ callables);
}

$ data = [
//表3-8。在UTF-8转换中使用U + FFFD
// http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
\x61。\xF1 \x80 \x80。\xE1\x80。\xC2。\x62。\x80。\x63
。\x80 。\xBF。\ x64,

//'FULL MOON SYMBOL'(U + 1F315)和无效字节序列
\xF0\x9F\ x8C \x95。\xF0\x9F\x8C。\xF0\x9F\x8C
];

var_dump(run([
'replace_invalid_byte_sequence',
'replace_invalid_byte_sequence2',
'replace_invalid_byte_sequence3',
'replace_invalid_byte_sequence4',
' replace_invalid_byte_sequence5',
'replace_invalid_byte_sequence6'
],$ data));

注意, mb_convert_encoding 有一个错误,或无效字节序列后删除无效字节序列,而不添加 U + FFFD

  $ data = [
// U + 20AC
\xE2\x82\xAC。\xE2\x82\xAC。\xE2\x82\xAC ,
\xE2\x82。\xE2\x82\xAC。\xE2\x82\xAC,

// U + 24B62
\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2 ,
\xF0\xA4\xAD。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2,
\xA4\xAD\xA2。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2,

//'FULL MOON SYMBOL'(U + 1F315)
\xF0\x9F\x8C\x95。 \xF0\x9F\x8C,
\xF0\x9F\x8C\x95。 \xF0\x9F\x8C。 \xF0\x9F\x8C
];

虽然 preg_match()可以使用 integ_replace_callback ,这个函数有一个限制bytesize。有关详细信息,请参阅错误报告#36463 。您可以通过以下测试用例来确认。

  str_repeat('a',10000)



最后,我的基准结果如下。

 code> mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter :: transcode()
0.15999984741211
UConverter :: convert
0.29843020439148
preg_replace_callback()
0.63967490196228
直接比较
0.71933102607727

基准代码在这里。

 函数定时器(array $ callables,array $ arguments,$ repeat = 10000)
{

$ ret = [];
$ save = $ repeat;

foreach($ callables as $ key => $ callable){

$ start = microtime(true);

do {

array_map($ callable,$ arguments);

} while($ repeat - = 1);

$ stop = microtime(true);
$ ret [$ key] = $ stop - $ start;
$ repeat = $ save;

}

return $ ret;
}

$ functions = [
'mb_convert_encoding()'=> 'replace_invalid_byte_sequence',
'htmlspecialchars()'=> 'replace_invalid_byte_sequence2',
'UConverter :: transcode()'=> 'replace_invalid_byte_sequence3',
'UConverter :: convert()'=> 'replace_invalid_byte_sequence4',
'preg_replace_callback()'=> 'replace_invalid_byte_sequence5',
'direct comparision'=> 'replace_invalid_byte_sequence6'
];

foreach(timer($ functions,$ data)as $ description => $ time){

echo $ description,PHP_EOL,
$ time,PHP_EOL ;

}


I would like to replace invalid UTF-8 chars with quotation marks (PHP 5.3.5).

So far I have this solution, but invalid characters are removed, instead of being replaced by '?'.

function replace_invalid_utf8($str)
{
  return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

echo mb_substitute_character()."\n";

echo replace_invalid_utf8('éééaaaàààeeé')."\n";
echo replace_invalid_utf8('eeeaaaaaaeeé')."\n";

Should output:

63 // ASCII code for '?' character
???aaa???eé // or ??aa??eé
eeeaaaaaaeeé

But currently outputs:

63
aaaee // removed invalid characters
eeeaaaaaaeeé

Any advice?

Would you do it another way (using a preg_replace() for example?)

Thanks.

解决方案

You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.

Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.

When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).

// REPLACEMENT CHARACTER (U+FFFD)
mb_substitute_character(0xFFFD);

function replace_invalid_byte_sequence($str)
{
    return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

UConverter offers both procedual and object-oriented API.

function replace_invalid_byte_sequence3($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence4($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.

lead byte: 0x00 - 0x7F, 0xC2 - 0xF4
trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)

you can refer to the following resources for checking the byte range.

  1. "Syntax of UTF-8 Byte Sequences" in RFC 3629
  2. "Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.1
  3. "Multilingual form encoding" in W3C Internationalization"

The byte range table is the below.

      Code Points    First Byte Second Byte Third Byte Fourth Byte
  U+0000 -   U+007F   00 - 7F
  U+0080 -   U+07FF   C2 - DF    80 - BF
  U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
  U+1000 -   U+CFFF   E1 - EC    80 - BF     80 - BF
  U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
  U+E000 -   U+FFFF   EE - EF    80 - BF     80 - BF
 U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
 U+40000 -  U+FFFFF   F1 - F3    80 - BF     80 - BF    80 - BF
U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF

How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.

The Unicode Standard shows an example:

before: <61    F1 80 80  E1 80  C2    62    80    63    80    BF    64  >
after:  <0061  FFFD      FFFD   FFFD  0062  FFFD  0063  FFFD  FFFD  0064>

Here is the implementation by preg_replace_callback() according to the above rule.

function replace_invalid_byte_sequence5($str)
{
    // REPLACEMENT CHARACTER (U+FFFD)
    $substitute = "\xEF\xBF\xBD";
    $regex = '/
      ([\x00-\x7F]                       #   U+0000 -   U+007F
      |[\xC2-\xDF][\x80-\xBF]            #   U+0080 -   U+07FF
      | \xE0[\xA0-\xBF][\x80-\xBF]       #   U+0800 -   U+0FFF
      |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} #   U+1000 -   U+CFFF
      | \xED[\x80-\x9F][\x80-\xBF]       #   U+D000 -   U+D7FF
      | \xF0[\x90-\xBF][\x80-\xBF]{2}    #  U+10000 -  U+3FFFF
      |[\xF1-\xF3][\x80-\xBF]{3}         #  U+40000 -  U+FFFFF
      | \xF4[\x80-\x8F][\x80-\xBF]{2})   # U+100000 - U+10FFFF
      |(\xE0[\xA0-\xBF]                  #   U+0800 -   U+0FFF (invalid)
      |[\xE1-\xEC\xEE\xEF][\x80-\xBF]    #   U+1000 -   U+CFFF (invalid)
      | \xED[\x80-\x9F]                  #   U+D000 -   U+D7FF (invalid)
      | \xF0[\x90-\xBF][\x80-\xBF]?      #  U+10000 -  U+3FFFF (invalid)
      |[\xF1-\xF3][\x80-\xBF]{1,2}       #  U+40000 -  U+FFFFF (invalid)
      | \xF4[\x80-\x8F][\x80-\xBF]?)     # U+100000 - U+10FFFF (invalid)
      |(.)                               # invalid 1-byte
    /xs';

    // $matches[1]: valid character
    // $matches[2]: invalid 3-byte or 4-byte character
    // $matches[3]: invalid 1-byte

    $ret = preg_replace_callback($regex, function($matches) use($substitute) {

        if (isset($matches[2]) || isset($matches[3])) {

            return $substitute;

        }

        return $matches[1];

    }, $str);

    return $ret;
}

You can compare byte directly and avoid preg_match's restriction about byte size by this way.

function replace_invalid_byte_sequence6($str) {

    $size = strlen($str);
    $substitute = "\xEF\xBF\xBD";
    $ret = '';

    $pos = 0;
    $char;
    $char_size;
    $valid;

    while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) {
        $ret .= $valid ? $char : $substitute;
    }

    return $ret;
}

function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid)
{
    $valid = false;

    if ($str_size <= $pos) {
        return false;
    }

    if ($str[$pos] < "\x80") {

        $valid = true;
        $char_size =  1;

    } else if ($str[$pos] < "\xC2") {

        $char_size = 1;

    } else if ($str[$pos] < "\xE0")  {

        if (!isset($str[$pos+1]) || $str[$pos+1] < "\x80" || "\xBF" < $str[$pos+1]) {

            $char_size = 1;

        } else {

            $valid = true;
            $char_size = 2;

        }

    } else if ($str[$pos] < "\xF0") {

        $left = "\xE0" === $str[$pos] ? "\xA0" : "\x80";
        $right = "\xED" === $str[$pos] ? "\x9F" : "\xBF";

        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

            $char_size = 1;

        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) {

            $char_size = 2;

        } else {

            $valid = true;
            $char_size = 3;

       }

    } else if ($str[$pos] < "\xF5") {

        $left = "\xF0" === $str[$pos] ? "\x90" : "\x80";
        $right = "\xF4" === $str[$pos] ? "\x8F" : "\xBF";

        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

            $char_size = 1;

        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) {

            $char_size = 2;

        } else if (!isset($str[$pos+3]) || $str[$pos+3] < "\x80" || "\xBF" < $str[$pos+3]) {

            $char_size = 3;

        } else {

            $valid = true;
            $char_size = 4;

        }

    } else {

        $char_size = 1;

    }

    $char = substr($str, $pos, $char_size);
    $pos += $char_size;

    return true;
}

The test case is here.

function run(array $callables, array $arguments)
{
    return array_map(function($callable) use($arguments) {
         return array_map($callable, $arguments);
    }, $callables);
}

$data = [
    // Table 3-8. Use of U+FFFD in UTF-8 Conversion
    // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
    "\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63"
    ."\x80"."\xBF"."\x64",

    // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence
    "\xF0\x9F\x8C\x95"."\xF0\x9F\x8C"."\xF0\x9F\x8C"
];

var_dump(run([
    'replace_invalid_byte_sequence', 
    'replace_invalid_byte_sequence2',
    'replace_invalid_byte_sequence3',
    'replace_invalid_byte_sequence4',
    'replace_invalid_byte_sequence5',
    'replace_invalid_byte_sequence6'
], $data));

As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.

$data = [
    // U+20AC
    "\xE2\x82\xAC"."\xE2\x82\xAC"."\xE2\x82\xAC",
    "\xE2\x82"    ."\xE2\x82\xAC"."\xE2\x82\xAC",

    // U+24B62
    "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
    "\xF0\xA4\xAD"    ."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
    "\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",

    // 'FULL MOON SYMBOL' (U+1F315)
    "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C",
    "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C" . "\xF0\x9F\x8C"
];

Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.

str_repeat('a', 10000)

Finally, the result of my benchmark is following.

mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter::transcode()
0.15999984741211
UConverter::convert()
0.29843020439148
preg_replace_callback()
0.63967490196228
direct comparision
0.71933102607727

The benchmark code is here.

function timer(array $callables, array $arguments, $repeat = 10000)
{

    $ret = [];
    $save = $repeat;

    foreach ($callables as $key => $callable) {

        $start = microtime(true);

        do {

            array_map($callable, $arguments);

        } while($repeat -= 1);

        $stop = microtime(true);
        $ret[$key] = $stop - $start;
        $repeat = $save;

    }

    return $ret;
}

$functions = [
    'mb_convert_encoding()' => 'replace_invalid_byte_sequence',
    'htmlspecialchars()' => 'replace_invalid_byte_sequence2',
    'UConverter::transcode()' => 'replace_invalid_byte_sequence3',
    'UConverter::convert()' => 'replace_invalid_byte_sequence4',
    'preg_replace_callback()' => 'replace_invalid_byte_sequence5',
    'direct comparision' => 'replace_invalid_byte_sequence6'
];

foreach (timer($functions, $data) as $description => $time) {

    echo $description, PHP_EOL,
         $time, PHP_EOL;

}

这篇关于通过问号替换无效的UTF-8字符,mbstring.substitute_character似乎被忽略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆