正则表达式检测无效的 UTF-8 字符串 [英] Regex to detect invalid UTF-8 string

查看:32
本文介绍了正则表达式检测无效的 UTF-8 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 PHP 中,我们可以使用 mb_check_encoding() 来判断是否字符串是有效的 UTF-8.但这不是一个可移植的解决方案,因为它需要编译并启用 mbstring 扩展.此外,它不会告诉我们哪个字符无效.

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8. But that's not a portable solution as it requires the mbstring extension to be compiled in and enabled. Additionally, it won't tell us which character is invalid.

是否有可以匹配给定字符串中无效 UTF-8 字节的正则表达式(或其他 100% 可移植的方法)?

Is there a regular expression (or another other 100% portable method) that can match invalid UTF-8 bytes in a given string?

这样,如果需要,可以替换这些字节(保留二进制信息,例如在构建包含二进制数据的测试输出 XML 文件时).因此,将字符转换为 UTF-8 会丢失信息.所以,我们可能想要转换:

That way, those bytes can be replaced if needed (keeping the binary information, such as when building a test output XML file that includes binary data). So converting the characters to UTF-8 would lose information. So, we may want to convert:

"foo" . chr(128) . chr(255)

进入

"foo<128><255>"

所以只是检测"字符串不够好,我们需要能够检测哪些字符是无效的.

So just "detecting" that the string is not good enough, we'd need to be able to detect which characters are invalid.

推荐答案

您可以使用此 PCRE 正则表达式来检查字符串中的有效 UTF-8.如果正则表达式匹配,则字符串包含无效的字节序列.它是 100% 可移植的,因为它不依赖于 PCRE_UTF8 来编译.

You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.

$regex = '/(
    [xC0-xC1] # Invalid UTF-8 Bytes
    | [xF5-xFF] # Invalid UTF-8 Bytes
    | xE0[x80-x9F] # Overlong encoding of prior code point
    | xF0[x80-x8F] # Overlong encoding of prior code point
    | [xC2-xDF](?![x80-xBF]) # Invalid UTF-8 Sequence Start
    | [xE0-xEF](?![x80-xBF]{2}) # Invalid UTF-8 Sequence Start
    | [xF0-xF4](?![x80-xBF]{3}) # Invalid UTF-8 Sequence Start
    | (?<=[x00-x7FxF5-xFF])[x80-xBF] # Invalid UTF-8 Sequence Middle
    | (?<![xC2-xDF]|[xE0-xEF]|[xE0-xEF][x80-xBF]|[xF0-xF4]|[xF0-xF4][x80-xBF]|[xF0-xF4][x80-xBF]{2})[x80-xBF] # Overlong Sequence
    | (?<=[xE0-xEF])[x80-xBF](?![x80-xBF]) # Short 3 byte sequence
    | (?<=[xF0-xF4])[x80-xBF](?![x80-xBF]{2}) # Short 4 byte sequence
    | (?<=[xF0-xF4][x80-xBF])[x80-xBF](?![x80-xBF]) # Short 4 byte sequence (2)
)/x';

我们可以通过创建一些文本变体来测试它:

We can test it by creating a few variations of text:

// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);        
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)

等等...

事实上,由于这匹配无效字节,您可以在 preg_replace 中使用它来替换它们:

In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:

preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points

这篇关于正则表达式检测无效的 UTF-8 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆