正则表达式检测无效的UTF-8字符串 [英] Regex to detect Invalid UTF-8 String

查看:420
本文介绍了正则表达式检测无效的UTF-8字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在PHP中,我们可以使用 mb_check_encoding() 来确定字符串是否为有效的UTF-8.但这不是可移植的解决方案,因为它需要编译并启用mbstring扩展名.此外,它不会告诉我们哪个字符无效.

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8. But that's not a portable solution as it requires the mbstring extension to be compiled in and enabled. Additionally, it won't tell us which character is invalid.

是否存在可以匹配给定字符串中无效的UTF-8字节的正则表达式(或其他100%可移植的其他方法).这样,可以在需要时替换这些字节(保留二进制信息,例如在构建包含二进制数据的测试输出xml文件时).因此,将字符转换为UTF-8会丢失信息.因此,我们可能要转换:

Is there a regular expression (or another other 100% portable method) that can match invalid UTF-8 bytes in a given string. That way, those bytes can be replaced if needed (keeping the binary information, such as when building a test output xml file that includes binary data). So converting the characters to UTF-8 would lose information. So, we may want to convert:

"foo" . chr(128) . chr(255)

进入

"foo<128><255>"

因此,只需检测"该字符串不够好,我们就需要能够检测到哪些字符无效.

So just "detecting" that the string is not good enough, we'd need to be able to detect which characters are invalid.

推荐答案

您可以使用此PCRE正则表达式检查字符串中的有效UTF-8.如果正则表达式匹配,则该字符串包含无效的字节序列.它具有100%的可移植性,因为它不依赖于PCRE_UTF8进行编译.

You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.

$regex = '/(
    [\xC0-\xC1] # Invalid UTF-8 Bytes
    | [\xF5-\xFF] # Invalid UTF-8 Bytes
    | \xE0[\x80-\x9F] # Overlong encoding of prior code point
    | \xF0[\x80-\x8F] # Overlong encoding of prior code point
    | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
    | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
    | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
    | (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
    | (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
    | (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
    | (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
    | (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
)/x';

我们可以通过创建一些文本变体来对其进行测试:

We can test it by creating a few variations of text:

// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);        
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)

等...

实际上,由于此匹配无效字节,因此您可以在preg_replace中使用它替换掉它们:

In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:

preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points

这篇关于正则表达式检测无效的UTF-8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆