通过问号替换无效的UTF-8字符,mbstring.substitute_character似乎被忽略 [英] Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored
问题描述
我想用无效的UTF-8字符替换为引号(PHP 5.3.5)。
到目前为止我有这个解决方案, ,而不是由?替换。
function replace_invalid_utf8($ str)
{
return mb_convert_encoding($ str,'UTF-8','UTF-8');
}
echo mb_substitute_character()。\\\
;
echo replace_invalid_utf8('éééaaaàààeeé')。\\\
;
echo replace_invalid_utf8('eeeaaaaaaeeé')。\\\
;
应输出:
63 //'?'字符的ASCII代码
??? aaa ???eé//或?? aa ??eé
eeeaaaaaaeeé
但目前输出:
63
aaaee //删除无效字符
eeeaaaaaaeeé
任何建议? >
可以用另一种方式(例如使用 preg_replace()
)
$
您可以使用 mb_convert_encoding() htmlspecialchars()的 ENT_SUBSTITUTE 选项。对于cource,您也可以使用 preg_match()。如果您使用intl,则可以从PHP 5.5开始使用 UConverter 。
无效字节序列的推荐替代字符为 U + FFFD 。请参阅UTR#36:Unicode安全注意事项中的 3.1.2替换Ill形成的子序列
使用 mb_convert_encoding()时,您可以通过将Unicode代码点传递给 mb_substitute_character()或 mbstring.substitute_character 指令。替换的默认字符是? (QUESTION MARK - U + 003F)。 // REPLACEMENT CHARACTER(U + FFFD)
mb_substitute_character(0xFFFD) ;
function replace_invalid_byte_sequence($ str)
{
return mb_convert_encoding($ str,'UTF-8','UTF-8');
}
function replace_invalid_byte_sequence2($ str)
{
return htmlspecialchars_decode(htmlspecialchars($ str,ENT_SUBSTITUTE,'UTF-8'));
}
UConverter 提供过程和面向对象API 。
function replace_invalid_byte_sequence3($ str)
{
return UConverter :: transcode($ str,'UTF -8','UTF-8');
}
function replace_invalid_byte_sequence4($ str)
{
return(new UConverter('UTF-8','UTF-8')) - & ($ str);
}
使用 preg_match()注意避免UTF-8非最短形式的漏洞的字节范围。
超前字节:0x00 - 0x7F,0xC2 - 0xF4
trail byte:0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)
到以下用于检查字节范围的资源。
字节范围表如下。
代码点第一个字节第二个字节第三字节第四字节
U + 0000 - U + 007F 00 - 7F
U + 0080 - U + 07FF C2 - DF 80 - BF
U + 0800 - U + 0FFF E0 A0 - BF 80 - BF
U + 1000 - U + CFFF E1 - EC 80 - BF 80 - BF
U + D000 - U + D7FF ED 80 - 9F 80 - BF
U + E000 - U + FFFF EE - EF 80 - BF 80 - BF
U + 10000 - U + 3FFFF F0 90 - BF 80 - BF 80 - BF
U + 40000 - U + FFFFF F1 - F3 80 - BF 80 - BF 80 - BF
U + 100000 - U + 10FFFF F4 80 - 8F 80 - BF 80 - BF
如何替换无效字节序列而不破坏有效字符,请参见 3.1.1 Ill-Formed UTR#36中的子序列:Unicode安全注意事项和表3-8。 Unicode标准中的U + FFFD在UTF-8转换中的使用。
Unicode标准显示了一个例子:
before:< 61 F1 80 80 E1 80 C2 62 80 63 80 BF 64>
after:< 0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064>
以下是 preg_replace_callback()上述规则。
函数replace_invalid_byte_sequence5($ str)
{
//替换字符FFFD)
$ substitute =\xEF\xBF\xBD;
$ regex ='/
([\x00-\x7F]#U + 0000 - U + 007F
| [\xC2-\xDF] [\x80-\xBF]#U + 0080 - U + 07FF
| \xE0 [\xA0-\xBF] [\x80-\xBF]#U + 0800 - U + 0FFF
| [\xE1-\xEC\xEE\xEF] [\ x80-\xBF] {2} U + 1000 - U + CFFF
| \xED [\x80-\x9F] [\x80-\xBF]#U + D000 - U + D7FF
| \xF0 [\x90-\xBF] [\x80-\xBF] {2}#U + 10000 - U + 3FFFF
| [\xF1-\xF3] [\x80 -\xBF] {3}#U + 40000-U + FFFFF
| \xF4 [\x80-\x8F] [\x80-\xBF] {2})#U + 100000 - U + 10FFFF
|(\xE0 [\xA0-\xBF ]#U + 0800 - U + 0FFF(无效)
| [\xE1-\xEC\xEE\xEF] [\x80-\xBF]#U + 1000 - U + CFFF无效)
| \xED [\x80-\x9F]#U + D000 - U + D7FF(无效)
| \xF0 [\x90- \xBF] x80-\xBF]?#U + 10000 - U + 3FFFF(无效)
| [\xF1-\xF3] [\x80- \xBF] {1,2}#U + 40000 - U + FFFFF(invalid)
| \xF4 [\x80-\x8F] [\x80-\xBF]?)#U + 100000 - U + 10FFFF(无效)
|(。)#invalid 1-byte
/ xs';
// $ matches [1]:有效字符
// $ matches [2]:无效的3字节或4字节字符
// $ matches [3] :invalid 1-byte
$ ret = preg_replace_callback($ regex,function($ matches)use($ substitute){
if(isset($ matches [2]) || isset($ matches [3])){
return $ substitute;
}
return $ matches [1];
},$ str);
return $ ret;
}
您可以直接比较字节,避免preg_match通过这种方式限制字节大小。
function replace_invalid_byte_sequence6($ str){
$ size = strlen($ str);
$ substitute =\xEF\xBF\xBD;
$ ret ='';
$ pos = 0;
$ char;
$ char_size;
$ valid;
while(utf8_get_next_char($ str,$ size,$ pos,$ char,$ char_size,$ valid)){
$ ret。= $ valid? $ char:$ substitute;
}
return $ ret;
}
function utf8_get_next_char($ str,$ str_size,& $ pos,& $ char,& $ char_size,& $ valid)
{
$ valid = false;
if($ str_size< = $ pos){
return false;
}
if($ str [$ pos]<\x80){
$ valid = true;
$ char_size = 1;
} else if($ str [$ pos]<\xC2){
$ char_size = 1;
} else if($ str [$ pos]<\xE0){
if(!isset($ str [$ pos + 1])| | $ str [$ pos + 1]<\x80||\xBF< $ str [$ pos + 1]){
$ char_size = 1;
} else {
$ valid = true;
$ char_size = 2;
}
} else if($ str [$ pos]<\xF0){
$ left =\ xE0=== $ str [$ pos]? \xA0:\x80;
$ right =\xED=== $ str [$ pos]? \x9F:\xBF;
if(!isset($ str [$ pos + 1])|| $ str [$ pos + 1]< $ left || $ right< $ str [$ pos + 1] ){
$ char_size = 1;
} else if(!isset($ str [$ pos + 2])|| $ str [$ pos + 2]<\x80||\xBF $ str [$ pos + 2]){
$ char_size = 2;
} else {
$ valid = true;
$ char_size = 3;
}
} else if($ str [$ pos]<\xF5){
$ left =\ xF0=== $ str [$ pos]? \x90:\x80;
$ right =\xF4=== $ str [$ pos]? \x8F:\xBF;
if(!isset($ str [$ pos + 1])|| $ str [$ pos + 1]< $ left || $ right< $ str [$ pos + 1] ){
$ char_size = 1;
} else if(!isset($ str [$ pos + 2])|| $ str [$ pos + 2]<\x80||\xBF< $ str [$ pos + 2]){
$ char_size = 2;
} else if(!isset($ str [$ pos + 3])|| $ str [$ pos + 3]<\x80||\xBF $ str [$ pos + 3]){
$ char_size = 3;
} else {
$ valid = true;
$ char_size = 4;
}
} else {
$ char_size = 1;
}
$ char = substr($ str,$ pos,$ char_size);
$ pos + = $ char_size;
return true;
}
测试用例在这里。
function run(array $ callables,array $ arguments)
{
return array_map(function($ callable)use($ arguments){
return array_map($ callable,$ arguments);
},$ callables);
}
$ data = [
//表3-8。在UTF-8转换中使用U + FFFD
// http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
\x61。\xF1 \x80 \x80。\xE1\x80。\xC2。\x62。\x80。\x63
。\x80 。\xBF。\ x64,
//'FULL MOON SYMBOL'(U + 1F315)和无效字节序列
\xF0\x9F\ x8C \x95。\xF0\x9F\x8C。\xF0\x9F\x8C
];
var_dump(run([
'replace_invalid_byte_sequence',
'replace_invalid_byte_sequence2',
'replace_invalid_byte_sequence3',
'replace_invalid_byte_sequence4',
' replace_invalid_byte_sequence5',
'replace_invalid_byte_sequence6'
],$ data));
注意, mb_convert_encoding 有一个错误,或无效字节序列后删除无效字节序列,而不添加 U + FFFD 。
$ data = [
// U + 20AC
\xE2\x82\xAC。\xE2\x82\xAC。\xE2\x82\xAC ,
\xE2\x82。\xE2\x82\xAC。\xE2\x82\xAC,
// U + 24B62
\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2 ,
\xF0\xA4\xAD。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2,
\xA4\xAD\xA2。\xF0\xA4\xAD\xA2。\xF0\xA4\xAD\xA2,
//'FULL MOON SYMBOL'(U + 1F315)
\xF0\x9F\x8C\x95。 \xF0\x9F\x8C,
\xF0\x9F\x8C\x95。 \xF0\x9F\x8C。 \xF0\x9F\x8C
];
虽然 preg_match()可以使用 integ_replace_callback ,这个函数有一个限制bytesize。有关详细信息,请参阅错误报告#36463 。您可以通过以下测试用例来确认。
str_repeat('a',10000)
最后,我的基准结果如下。
code> mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter :: transcode()
0.15999984741211
UConverter :: convert
0.29843020439148
preg_replace_callback()
0.63967490196228
直接比较
0.71933102607727
基准代码在这里。
函数定时器(array $ callables,array $ arguments,$ repeat = 10000)
{
$ ret = [];
$ save = $ repeat;
foreach($ callables as $ key => $ callable){
$ start = microtime(true);
do {
array_map($ callable,$ arguments);
} while($ repeat - = 1);
$ stop = microtime(true);
$ ret [$ key] = $ stop - $ start;
$ repeat = $ save;
}
return $ ret;
}
$ functions = [
'mb_convert_encoding()'=> 'replace_invalid_byte_sequence',
'htmlspecialchars()'=> 'replace_invalid_byte_sequence2',
'UConverter :: transcode()'=> 'replace_invalid_byte_sequence3',
'UConverter :: convert()'=> 'replace_invalid_byte_sequence4',
'preg_replace_callback()'=> 'replace_invalid_byte_sequence5',
'direct comparision'=> 'replace_invalid_byte_sequence6'
];
foreach(timer($ functions,$ data)as $ description => $ time){
echo $ description,PHP_EOL,
$ time,PHP_EOL ;
}
I would like to replace invalid UTF-8 chars with quotation marks (PHP 5.3.5).
So far I have this solution, but invalid characters are removed, instead of being replaced by '?'.
function replace_invalid_utf8($str) { return mb_convert_encoding($str, 'UTF-8', 'UTF-8'); } echo mb_substitute_character()."\n"; echo replace_invalid_utf8('éééaaaàààeeé')."\n"; echo replace_invalid_utf8('eeeaaaaaaeeé')."\n";
Should output:
63 // ASCII code for '?' character ???aaa???eé // or ??aa??eé eeeaaaaaaeeé
But currently outputs:
63 aaaee // removed invalid characters eeeaaaaaaeeé
Any advice?
Would you do it another way (using a
preg_replace()
for example?)Thanks.
解决方案You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.
Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.
When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).
// REPLACEMENT CHARACTER (U+FFFD) mb_substitute_character(0xFFFD); function replace_invalid_byte_sequence($str) { return mb_convert_encoding($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence2($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); }
UConverter offers both procedual and object-oriented API.
function replace_invalid_byte_sequence3($str) { return UConverter::transcode($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence4($str) { return (new UConverter('UTF-8', 'UTF-8'))->convert($str); }
When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.
lead byte: 0x00 - 0x7F, 0xC2 - 0xF4 trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)
you can refer to the following resources for checking the byte range.
- "Syntax of UTF-8 Byte Sequences" in RFC 3629
- "Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.1
- "Multilingual form encoding" in W3C Internationalization"
The byte range table is the below.
Code Points First Byte Second Byte Third Byte Fourth Byte U+0000 - U+007F 00 - 7F U+0080 - U+07FF C2 - DF 80 - BF U+0800 - U+0FFF E0 A0 - BF 80 - BF U+1000 - U+CFFF E1 - EC 80 - BF 80 - BF U+D000 - U+D7FF ED 80 - 9F 80 - BF U+E000 - U+FFFF EE - EF 80 - BF 80 - BF U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF U+40000 - U+FFFFF F1 - F3 80 - BF 80 - BF 80 - BF U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF
How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.
The Unicode Standard shows an example:
before: <61 F1 80 80 E1 80 C2 62 80 63 80 BF 64 > after: <0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064>
Here is the implementation by preg_replace_callback() according to the above rule.
function replace_invalid_byte_sequence5($str) { // REPLACEMENT CHARACTER (U+FFFD) $substitute = "\xEF\xBF\xBD"; $regex = '/ ([\x00-\x7F] # U+0000 - U+007F |[\xC2-\xDF][\x80-\xBF] # U+0080 - U+07FF | \xE0[\xA0-\xBF][\x80-\xBF] # U+0800 - U+0FFF |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # U+1000 - U+CFFF | \xED[\x80-\x9F][\x80-\xBF] # U+D000 - U+D7FF | \xF0[\x90-\xBF][\x80-\xBF]{2} # U+10000 - U+3FFFF |[\xF1-\xF3][\x80-\xBF]{3} # U+40000 - U+FFFFF | \xF4[\x80-\x8F][\x80-\xBF]{2}) # U+100000 - U+10FFFF |(\xE0[\xA0-\xBF] # U+0800 - U+0FFF (invalid) |[\xE1-\xEC\xEE\xEF][\x80-\xBF] # U+1000 - U+CFFF (invalid) | \xED[\x80-\x9F] # U+D000 - U+D7FF (invalid) | \xF0[\x90-\xBF][\x80-\xBF]? # U+10000 - U+3FFFF (invalid) |[\xF1-\xF3][\x80-\xBF]{1,2} # U+40000 - U+FFFFF (invalid) | \xF4[\x80-\x8F][\x80-\xBF]?) # U+100000 - U+10FFFF (invalid) |(.) # invalid 1-byte /xs'; // $matches[1]: valid character // $matches[2]: invalid 3-byte or 4-byte character // $matches[3]: invalid 1-byte $ret = preg_replace_callback($regex, function($matches) use($substitute) { if (isset($matches[2]) || isset($matches[3])) { return $substitute; } return $matches[1]; }, $str); return $ret; }
You can compare byte directly and avoid preg_match's restriction about byte size by this way.
function replace_invalid_byte_sequence6($str) { $size = strlen($str); $substitute = "\xEF\xBF\xBD"; $ret = ''; $pos = 0; $char; $char_size; $valid; while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) { $ret .= $valid ? $char : $substitute; } return $ret; } function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid) { $valid = false; if ($str_size <= $pos) { return false; } if ($str[$pos] < "\x80") { $valid = true; $char_size = 1; } else if ($str[$pos] < "\xC2") { $char_size = 1; } else if ($str[$pos] < "\xE0") { if (!isset($str[$pos+1]) || $str[$pos+1] < "\x80" || "\xBF" < $str[$pos+1]) { $char_size = 1; } else { $valid = true; $char_size = 2; } } else if ($str[$pos] < "\xF0") { $left = "\xE0" === $str[$pos] ? "\xA0" : "\x80"; $right = "\xED" === $str[$pos] ? "\x9F" : "\xBF"; if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) { $char_size = 1; } else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) { $char_size = 2; } else { $valid = true; $char_size = 3; } } else if ($str[$pos] < "\xF5") { $left = "\xF0" === $str[$pos] ? "\x90" : "\x80"; $right = "\xF4" === $str[$pos] ? "\x8F" : "\xBF"; if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) { $char_size = 1; } else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) { $char_size = 2; } else if (!isset($str[$pos+3]) || $str[$pos+3] < "\x80" || "\xBF" < $str[$pos+3]) { $char_size = 3; } else { $valid = true; $char_size = 4; } } else { $char_size = 1; } $char = substr($str, $pos, $char_size); $pos += $char_size; return true; }
The test case is here.
function run(array $callables, array $arguments) { return array_map(function($callable) use($arguments) { return array_map($callable, $arguments); }, $callables); } $data = [ // Table 3-8. Use of U+FFFD in UTF-8 Conversion // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf) "\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63" ."\x80"."\xBF"."\x64", // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence "\xF0\x9F\x8C\x95"."\xF0\x9F\x8C"."\xF0\x9F\x8C" ]; var_dump(run([ 'replace_invalid_byte_sequence', 'replace_invalid_byte_sequence2', 'replace_invalid_byte_sequence3', 'replace_invalid_byte_sequence4', 'replace_invalid_byte_sequence5', 'replace_invalid_byte_sequence6' ], $data));
As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.
$data = [ // U+20AC "\xE2\x82\xAC"."\xE2\x82\xAC"."\xE2\x82\xAC", "\xE2\x82" ."\xE2\x82\xAC"."\xE2\x82\xAC", // U+24B62 "\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", "\xF0\xA4\xAD" ."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", "\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2", // 'FULL MOON SYMBOL' (U+1F315) "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C", "\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C" . "\xF0\x9F\x8C" ];
Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.
str_repeat('a', 10000)
Finally, the result of my benchmark is following.
mb_convert_encoding() 0.19628190994263 htmlspecialchars() 0.082863092422485 UConverter::transcode() 0.15999984741211 UConverter::convert() 0.29843020439148 preg_replace_callback() 0.63967490196228 direct comparision 0.71933102607727
The benchmark code is here.
function timer(array $callables, array $arguments, $repeat = 10000) { $ret = []; $save = $repeat; foreach ($callables as $key => $callable) { $start = microtime(true); do { array_map($callable, $arguments); } while($repeat -= 1); $stop = microtime(true); $ret[$key] = $stop - $start; $repeat = $save; } return $ret; } $functions = [ 'mb_convert_encoding()' => 'replace_invalid_byte_sequence', 'htmlspecialchars()' => 'replace_invalid_byte_sequence2', 'UConverter::transcode()' => 'replace_invalid_byte_sequence3', 'UConverter::convert()' => 'replace_invalid_byte_sequence4', 'preg_replace_callback()' => 'replace_invalid_byte_sequence5', 'direct comparision' => 'replace_invalid_byte_sequence6' ]; foreach (timer($functions, $data) as $description => $time) { echo $description, PHP_EOL, $time, PHP_EOL; }
这篇关于通过问号替换无效的UTF-8字符,mbstring.substitute_character似乎被忽略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!