通过问号替换无效的UTF-8字符，mbstring.substitute_character似乎被忽略 [英] Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

查看：287 发布时间：2016/11/19 12:49:14 php utf-8 character-encoding mbstring

本文介绍了通过问号替换无效的UTF-8字符，mbstring.substitute_character似乎被忽略的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用无效的UTF-8字符替换为引号（PHP 5.3.5）。

到目前为止我有这个解决方案，，而不是由？替换。

  function replace_invalid_utf8（$ str）
 {
 return mb_convert_encoding（$ str，'UTF-8'，'UTF-8'）; 
} 
 
 echo mb_substitute_character（）。\\\
; 
 
 echo replace_invalid_utf8（'éééaaaàààeeÃ©'）。\\\
; 
 echo replace_invalid_utf8（'eeeaaaaaaeeÃ©'）。\\\
;

应输出：

  63 //'？'字符的ASCII代码
 ??? aaa ???eé//或?? aa ??eé
eeeaaaaaaeeé

但目前输出：

  63 
 aaaee //删除无效字符
eeeaaaaaaeeé

任何建议？ >

可以用另一种方式（例如使用 preg_replace（））

您可以使用 mb_convert_encoding（） htmlspecialchars（）的 ENT_SUBSTITUTE 选项。对于cource，您也可以使用 preg_match（）。如果您使用intl，则可以从PHP 5.5开始使用 UConverter 。

无效字节序列的推荐替代字符为 U + FFFD 。请参阅UTR＃36：Unicode安全注意事项中的 3.1.2替换Ill形成的子序列

使用 mb_convert_encoding（）时，您可以通过将Unicode代码点传递给 mb_substitute_character（）或 mbstring.substitute_character 指令。替换的默认字符是？（QUESTION MARK - U + 003F）。

  // REPLACEMENT CHARACTER（U + FFFD）
 mb_substitute_character（0xFFFD） ; 
 
 function replace_invalid_byte_sequence（$ str）
 {
 return mb_convert_encoding（$ str，'UTF-8'，'UTF-8'）; 
} 
 
 function replace_invalid_byte_sequence2（$ str）
 {
 return htmlspecialchars_decode（htmlspecialchars（$ str，ENT_SUBSTITUTE，'UTF-8'））; 
}

UConverter 提供过程和面向对象API 。

  function replace_invalid_byte_sequence3（$ str）
 {
 return UConverter :: transcode（$ str，'UTF -8'，'UTF-8'）; 
} 
 
 function replace_invalid_byte_sequence4（$ str）
 {
 return（new UConverter（'UTF-8'，'UTF-8'）） - & （$ str）; 
}

使用 preg_match（）注意避免UTF-8非最短形式的漏洞的字节范围。

 超前字节：0x00  -  0x7F，0xC2  -  0xF4 
 trail byte：0x80（or 0x90 or 0xA0） -  0xBF（or 0x8F）

到以下用于检查字节范围的资源。

UTF-8字节序列语法

表3-7。Unicode标准6.1中的良好形式的UTF-8字节序列

W3C国际化中的多语言表单编码

字节范围表如下。

 代码点第一个字节第二个字节第三字节第四字节
 U + 0000  -  U + 007F 00  -  7F 
 U + 0080  -  U + 07FF C2  -  DF 80  -  BF 
 U + 0800  -  U + 0FFF E0 A0  -  BF 80  -  BF 
 U + 1000  -  U + CFFF E1  -  EC 80  -  BF 80  -  BF 
 U + D000  -  U + D7FF ED 80  -  9F 80  -  BF 
 U + E000  -  U + FFFF EE  -  EF 80  -  BF 80  -  BF 
 U + 10000  -  U + 3FFFF F0 90  -  BF 80  -  BF 80  -  BF 
 U + 40000  -  U + FFFFF F1  -  F3 80  -  BF 80 -  BF 80  -  BF 
 U + 100000  -  U + 10FFFF F4 80  -  8F 80  -  BF 80  -  BF

如何替换无效字节序列而不破坏有效字符，请参见 3.1.1 Ill-Formed UTR＃36中的子序列：Unicode安全注意事项和表3-8。 Unicode标准中的U + FFFD在UTF-8转换中的使用。

Unicode标准显示了一个例子：

  before：< 61 F1 80 80 E1 80 C2 62 80 63 80 BF 64> 
 after：< 0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064>

以下是 preg_replace_callback（）上述规则。

 函数replace_invalid_byte_sequence5（$ str）
 {
 //替换字符FFFD）
 $ substitute =\xEF\xBF\xBD; 
 $ regex ='/ 
（[\x00-\x7F]＃U + 0000  -  U + 007F 
 | [\xC2-\xDF] [\x80-\xBF]＃U + 0080  -  U + 07FF 
 | \xE0 [\xA0-\xBF] [\x80-\xBF]＃U + 0800  -  U + 0FFF 
 | [\xE1-\xEC\xEE\xEF] [\ x80-\xBF] {2} U + 1000  -  U + CFFF 
 | \xED [\x80-\x9F] [\x80-\xBF]＃U + D000  -  U + D7FF 
 | \xF0 [\x90-\xBF] [\x80-\xBF] {2}＃U + 10000  -  U + 3FFFF 
 | [\xF1-\xF3] [\x80 -\xBF] {3}＃U + 40000-U + FFFFF 
 | \xF4 [\x80-\x8F] [\x80-\xBF] {2}）＃U + 100000  -  U + 10FFFF 
 |（\xE0 [\xA0-\xBF ]＃U + 0800  -  U + 0FFF（无效）
 | [\xE1-\xEC\xEE\xEF] [\x80-\xBF]＃U + 1000  -  U + CFFF无效）
 | \xED [\x80-\x9F]＃U + D000  -  U + D7FF（无效）
 | \xF0 [\x90- \xBF] x80-\xBF]？＃U + 10000  -  U + 3FFFF（无效）
 | [\xF1-\xF3] [\x80- \xBF] {1,2}＃U + 40000 -  U + FFFFF（invalid）
 | \xF4 [\x80-\x8F] [\x80-\xBF]？）＃U + 100000  -  U + 10FFFF（无效）
 |（。）＃invalid 1-byte 
 / xs'; 
 
 // $ matches [1]：有效字符
 // $ matches [2]：无效的3字节或4字节字符
 // $ matches [3] ：invalid 1-byte 
 
 $ ret = preg_replace_callback（$ regex，function（$ matches）use（$ substitute）{
 
 if（isset（$ matches [2]） || isset（$ matches [3]））{
 
 return $ substitute; 
 
} 
 
 return $ matches [1]; 
 
}，$ str）; 
 
 return $ ret; 
}

您可以直接比较字节，避免preg_match通过这种方式限制字节大小。

  function replace_invalid_byte_sequence6（$ str）{
 
 $ size = strlen（$ str）; 
 $ substitute =\xEF\xBF\xBD; 
 $ ret =''; 
 
 $ pos = 0; 
 $ char; 
 $ char_size; 
 $ valid; 
 
 while（utf8_get_next_char（$ str，$ size，$ pos，$ char，$ char_size，$ valid））{
 $ ret。= $ valid？ $ char：$ substitute; 
} 
 
 return $ ret; 
} 
 
 function utf8_get_next_char（$ str，$ str_size，& $ pos，& $ char，& $ char_size，& $ valid）
 {
 $ valid = false; 
 
 if（$ str_size< = $ pos）{
 return false; 
} 
 
 if（$ str [$ pos]<\x80）{
 
 $ valid = true; 
 $ char_size = 1; 
 
} else if（$ str [$ pos]<\xC2）{
 
 $ char_size = 1; 
 
} else if（$ str [$ pos]<\xE0）{
 
 if（！isset（$ str [$ pos + 1]）| | $ str [$ pos + 1]<\x80||\xBF< $ str [$ pos + 1]）{
 
 $ char_size = 1; 
 
} else {
 
 $ valid = true; 
 $ char_size = 2; 
 
} 
 
} else if（$ str [$ pos]<\xF0）{
 
 $ left =\ xE0=== $ str [$ pos]？ \xA0：\x80; 
 $ right =\xED=== $ str [$ pos]？ \x9F：\xBF; 
 
 if（！isset（$ str [$ pos + 1]）|| $ str [$ pos + 1]< $ left || $ right< $ str [$ pos + 1] ）{
 
 $ char_size = 1; 
 
} else if（！isset（$ str [$ pos + 2]）|| $ str [$ pos + 2]<\x80||\xBF $ str [$ pos + 2]）{
 
 $ char_size = 2; 
 
} else {
 
 $ valid = true; 
 $ char_size = 3; 
 
} 
 
} else if（$ str [$ pos]<\xF5）{
 
 $ left =\ xF0=== $ str [$ pos]？ \x90：\x80; 
 $ right =\xF4=== $ str [$ pos]？ \x8F：\xBF; 
 
 if（！isset（$ str [$ pos + 1]）|| $ str [$ pos + 1]< $ left || $ right< $ str [$ pos + 1] ）{
 
 $ char_size = 1; 
 
} else if（！isset（$ str [$ pos + 2]）|| $ str [$ pos + 2]<\x80||\xBF< $ str [$ pos + 2]）{
 
 $ char_size = 2; 
 
} else if（！isset（$ str [$ pos + 3]）|| $ str [$ pos + 3]<\x80||\xBF $ str [$ pos + 3]）{
 
 $ char_size = 3; 
 
} else {
 
 $ valid = true; 
 $ char_size = 4; 
 
} 
 
} else {
 
 $ char_size = 1; 
 
} 
 
 $ char = substr（$ str，$ pos，$ char_size）; 
 $ pos + = $ char_size; 
 
 return true; 
}

测试用例在这里。

  function run（array $ callables，array $ arguments）
 {
 return array_map（function（$ callable）use（$ arguments）{
 return array_map（$ callable，$ arguments）; 
}，$ callables）; 
} 
 
 $ data = [
 //表3-8。在UTF-8转换中使用U + FFFD 
 // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf）
\x61。\xF1 \x80 \x80。\xE1\x80。\xC2。\x62。\x80。\x63
。\x80 。\xBF。\ x64，
 
 //'FULL MOON SYMBOL'（U + 1F315）和无效字节序列
\xF0\x9F\ x8C \x95。\xF0\x9F\x8C。\xF0\x9F\x8C
]; 
 
 var_dump（run（[
'replace_invalid_byte_sequence'，
'replace_invalid_byte_sequence2'，
'replace_invalid_byte_sequence3'，
'replace_invalid_byte_sequence4'，
' replace_invalid_byte_sequence5'，
'replace_invalid_byte_sequence6'
]，$ data））;

注意， mb_convert_encoding 有一个错误，或无效字节序列后删除无效字节序列，而不添加 U + FFFD 。

  $ data = [
 // U + 20AC 
\xE2\x82\xAC。\xE2\x82\xAC。\xE2\x82\xAC ，
\xE2\x82。\xE2\x82\xAC。\xE2\x82\xAC，
 
 // U + 24B62 
\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2 ，
\xF0\xA4\xAD。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2，
\xA4\xAD\xA2。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2，
 
 //'FULL MOON SYMBOL'（U + 1F315）
\xF0\x9F\x8C\x95。 \xF0\x9F\x8C，
\xF0\x9F\x8C\x95。 \xF0\x9F\x8C。 \xF0\x9F\x8C
];

虽然 preg_match（）可以使用 integ_replace_callback ，这个函数有一个限制bytesize。有关详细信息，请参阅错误报告＃36463 。您可以通过以下测试用例来确认。

  str_repeat（'a'，10000）
  
 
 
 最后，我的基准结果如下。 code> mb_convert_encoding（）
 0.19628190994263 
 htmlspecialchars（）
 0.082863092422485 
 UConverter :: transcode（）
 0.15999984741211 
 UConverter :: convert 
 0.29843020439148 
 preg_replace_callback（）
 0.63967490196228 
直接比较
 0.71933102607727 
  
基准代码在这里。
 函数定时器（array $ callables，array $ arguments，$ repeat = 10000）
 {
 
 $ ret = []; 
 $ save = $ repeat; 
 
 foreach（$ callables as $ key => $ callable）{
 
 $ start = microtime（true）; 
 
 do {
 
 array_map（$ callable，$ arguments）; 
 
} while（$ repeat  -  = 1）; 
 
 $ stop = microtime（true）; 
 $ ret [$ key] = $ stop  -  $ start; 
 $ repeat = $ save; 
 
} 
 
 return $ ret; 
} 
 
 $ functions = [
'mb_convert_encoding（）'=> 'replace_invalid_byte_sequence'，
'htmlspecialchars（）'=> 'replace_invalid_byte_sequence2'，
'UConverter :: transcode（）'=> 'replace_invalid_byte_sequence3'，
'UConverter :: convert（）'=> 'replace_invalid_byte_sequence4'，
'preg_replace_callback（）'=> 'replace_invalid_byte_sequence5'，
'direct comparision'=> 'replace_invalid_byte_sequence6'
]; 
 
 foreach（timer（$ functions，$ data）as $ description => $ time）{
 
 echo $ description，PHP_EOL，
 $ time，PHP_EOL ; 
 
} 
  
 
I would like to replace invalid UTF-8 chars with quotation marks (PHP 5.3.5).

So far I have this solution, but invalid characters are removed, instead of being replaced by '?'.
function replace_invalid_utf8($str)
{
  return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

echo mb_substitute_character()."\n";

echo replace_invalid_utf8('éééaaaàààeeÃ©')."\n";
echo replace_invalid_utf8('eeeaaaaaaeeÃ©')."\n";
Should output:
63 // ASCII code for '?' character
???aaa???eé // or ??aa??eé
eeeaaaaaaeeé
But currently outputs:
63
aaaee // removed invalid characters
eeeaaaaaaeeé
Any advice?

Would you do it another way (using a preg_replace() for example?)

Thanks.
 解决方案 
You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.

Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.

When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).
// REPLACEMENT CHARACTER (U+FFFD)
mb_substitute_character(0xFFFD);

function replace_invalid_byte_sequence($str)
{
    return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}
UConverter offers both procedual and object-oriented API. 
function replace_invalid_byte_sequence3($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence4($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}
When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on  the range of lead bytes.
lead byte: 0x00 - 0x7F, 0xC2 - 0xF4
trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)
you can refer to the following resources for checking the byte range.

"Syntax of UTF-8 Byte Sequences" in RFC 3629
"Table 3-7.  Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.1
"Multilingual form encoding" in W3C Internationalization"
The byte range table is the below.
      Code Points    First Byte Second Byte Third Byte Fourth Byte
  U+0000 -   U+007F   00 - 7F
  U+0080 -   U+07FF   C2 - DF    80 - BF
  U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
  U+1000 -   U+CFFF   E1 - EC    80 - BF     80 - BF
  U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
  U+E000 -   U+FFFF   EE - EF    80 - BF     80 - BF
 U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
 U+40000 -  U+FFFFF   F1 - F3    80 - BF     80 - BF    80 - BF
U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF
How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.

The Unicode Standard shows an example:
before: <61    F1 80 80  E1 80  C2    62    80    63    80    BF    64  >
after:  <0061  FFFD      FFFD   FFFD  0062  FFFD  0063  FFFD  FFFD  0064>
Here is the implementation by preg_replace_callback() according to the above rule.
function replace_invalid_byte_sequence5($str)
{
    // REPLACEMENT CHARACTER (U+FFFD)
    $substitute = "\xEF\xBF\xBD";
    $regex = '/
      ([\x00-\x7F]                       #   U+0000 -   U+007F
      |[\xC2-\xDF][\x80-\xBF]            #   U+0080 -   U+07FF
      | \xE0[\xA0-\xBF][\x80-\xBF]       #   U+0800 -   U+0FFF
      |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} #   U+1000 -   U+CFFF
      | \xED[\x80-\x9F][\x80-\xBF]       #   U+D000 -   U+D7FF
      | \xF0[\x90-\xBF][\x80-\xBF]{2}    #  U+10000 -  U+3FFFF
      |[\xF1-\xF3][\x80-\xBF]{3}         #  U+40000 -  U+FFFFF
      | \xF4[\x80-\x8F][\x80-\xBF]{2})   # U+100000 - U+10FFFF
      |(\xE0[\xA0-\xBF]                  #   U+0800 -   U+0FFF (invalid)
      |[\xE1-\xEC\xEE\xEF][\x80-\xBF]    #   U+1000 -   U+CFFF (invalid)
      | \xED[\x80-\x9F]                  #   U+D000 -   U+D7FF (invalid)
      | \xF0[\x90-\xBF][\x80-\xBF]?      #  U+10000 -  U+3FFFF (invalid)
      |[\xF1-\xF3][\x80-\xBF]{1,2}       #  U+40000 -  U+FFFFF (invalid)
      | \xF4[\x80-\x8F][\x80-\xBF]?)     # U+100000 - U+10FFFF (invalid)
      |(.)                               # invalid 1-byte
    /xs';

    // $matches[1]: valid character
    // $matches[2]: invalid 3-byte or 4-byte character
    // $matches[3]: invalid 1-byte

    $ret = preg_replace_callback($regex, function($matches) use($substitute) {

        if (isset($matches[2]) || isset($matches[3])) {

            return $substitute;

        }

        return $matches[1];

    }, $str);

    return $ret;
}
You can compare byte directly and avoid preg_match's restriction about byte size by this way.
function replace_invalid_byte_sequence6($str) {

    $size = strlen($str);
    $substitute = "\xEF\xBF\xBD";
    $ret = '';

    $pos = 0;
    $char;
    $char_size;
    $valid;

    while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) {
        $ret .= $valid ? $char : $substitute;
    }

    return $ret;
}

function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid)
{
    $valid = false;

    if ($str_size <= $pos) {
        return false;
    }

    if ($str[$pos] < "\x80") {

        $valid = true;
        $char_size =  1;

    } else if ($str[$pos] < "\xC2") {

        $char_size = 1;

    } else if ($str[$pos] < "\xE0")  {

        if (!isset($str[$pos+1]) || $str[$pos+1] < "\x80" || "\xBF" < $str[$pos+1]) {

            $char_size = 1;

        } else {

            $valid = true;
            $char_size = 2;

        }

    } else if ($str[$pos] < "\xF0") {

        $left = "\xE0" === $str[$pos] ? "\xA0" : "\x80";
        $right = "\xED" === $str[$pos] ? "\x9F" : "\xBF";

        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

            $char_size = 1;

        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) {

            $char_size = 2;

        } else {

            $valid = true;
            $char_size = 3;

       }

    } else if ($str[$pos] < "\xF5") {

        $left = "\xF0" === $str[$pos] ? "\x90" : "\x80";
        $right = "\xF4" === $str[$pos] ? "\x8F" : "\xBF";

        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

            $char_size = 1;

        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) {

            $char_size = 2;

        } else if (!isset($str[$pos+3]) || $str[$pos+3] < "\x80" || "\xBF" < $str[$pos+3]) {

            $char_size = 3;

        } else {

            $valid = true;
            $char_size = 4;

        }

    } else {

        $char_size = 1;

    }

    $char = substr($str, $pos, $char_size);
    $pos += $char_size;

    return true;
}
The test case is here.
function run(array $callables, array $arguments)
{
    return array_map(function($callable) use($arguments) {
         return array_map($callable, $arguments);
    }, $callables);
}

$data = [
    // Table 3-8. Use of U+FFFD in UTF-8 Conversion
    // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
    "\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63"
    ."\x80"."\xBF"."\x64",

    // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence
    "\xF0\x9F\x8C\x95"."\xF0\x9F\x8C"."\xF0\x9F\x8C"
];

var_dump(run([
    'replace_invalid_byte_sequence', 
    'replace_invalid_byte_sequence2',
    'replace_invalid_byte_sequence3',
    'replace_invalid_byte_sequence4',
    'replace_invalid_byte_sequence5',
    'replace_invalid_byte_sequence6'
], $data));
As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.
$data = [
    // U+20AC
    "\xE2\x82\xAC"."\xE2\x82\xAC"."\xE2\x82\xAC",
    "\xE2\x82"    ."\xE2\x82\xAC"."\xE2\x82\xAC",

    // U+24B62
    "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
    "\xF0\xA4\xAD"    ."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
    "\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",

    // 'FULL MOON SYMBOL' (U+1F315)
    "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C",
    "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C" . "\xF0\x9F\x8C"
];
Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.
str_repeat('a', 10000)
Finally, the result of my benchmark is following.
mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter::transcode()
0.15999984741211
UConverter::convert()
0.29843020439148
preg_replace_callback()
0.63967490196228
direct comparision
0.71933102607727
The benchmark code is here.
function timer(array $callables, array $arguments, $repeat = 10000)
{

    $ret = [];
    $save = $repeat;

    foreach ($callables as $key => $callable) {

        $start = microtime(true);

        do {

            array_map($callable, $arguments);

        } while($repeat -= 1);

        $stop = microtime(true);
        $ret[$key] = $stop - $start;
        $repeat = $save;

    }

    return $ret;
}

$functions = [
    'mb_convert_encoding()' => 'replace_invalid_byte_sequence',
    'htmlspecialchars()' => 'replace_invalid_byte_sequence2',
    'UConverter::transcode()' => 'replace_invalid_byte_sequence3',
    'UConverter::convert()' => 'replace_invalid_byte_sequence4',
    'preg_replace_callback()' => 'replace_invalid_byte_sequence5',
    'direct comparision' => 'replace_invalid_byte_sequence6'
];

foreach (timer($functions, $data) as $description => $time) {

    echo $description, PHP_EOL,
         $time, PHP_EOL;

}


                        
这篇关于通过问号替换无效的UTF-8字符，mbstring.substitute_character似乎被忽略的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

通过问号替换无效的UTF-8字符，mbstring.substitute_character似乎被忽略 [英] Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

通过问号替换无效的UTF-8字符，mbstring.substitute_character似乎被忽略 [英] Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭