如何在纯 C 中检测 UTF-8? [英] How to detect UTF-8 in plain C?
问题描述
我正在寻找一个普通的旧 C 代码片段,用于检测给定的字符串是 UTF-8 编码.我知道使用正则表达式的解决方案,但出于各种原因,在这种特殊情况下最好避免使用除纯 C 以外的任何内容.
I am looking for a code snippet in plain old C that detects that the given string is in UTF-8 encoding. I know the solution with regex, but for various reasons it would be better to avoid using anything but plain C in this particular case.
使用正则表达式的解决方案如下所示(警告:省略了各种检查):
Solution with regex looks like this (warning: various checks omitted):
#define UTF8_DETECT_REGEXP "^([x09x0Ax0Dx20-x7E]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})*$"
const char *error;
int error_off;
int rc;
int vect[100];
utf8_re = pcre_compile(UTF8_DETECT_REGEXP, PCRE_CASELESS, &error, &error_off, NULL);
utf8_pe = pcre_study(utf8_re, 0, &error);
rc = pcre_exec(utf8_re, utf8_pe, str, len, 0, 0, vect, sizeof(vect)/sizeof(vect[0]));
if (rc > 0) {
printf("string is in UTF8
");
} else {
printf("string is not in UTF8
")
}
推荐答案
这是 这个表达式在普通 C:
_Bool is_utf8(const char * string)
{
if(!string)
return 0;
const unsigned char * bytes = (const unsigned char *)string;
while(*bytes)
{
if( (// ASCII
// use bytes[0] <= 0x7F to allow ASCII control characters
bytes[0] == 0x09 ||
bytes[0] == 0x0A ||
bytes[0] == 0x0D ||
(0x20 <= bytes[0] && bytes[0] <= 0x7E)
)
) {
bytes += 1;
continue;
}
if( (// non-overlong 2-byte
(0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF)
)
) {
bytes += 2;
continue;
}
if( (// excluding overlongs
bytes[0] == 0xE0 &&
(0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// straight 3-byte
((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
bytes[0] == 0xEE ||
bytes[0] == 0xEF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// excluding surrogates
bytes[0] == 0xED &&
(0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
)
) {
bytes += 3;
continue;
}
if( (// planes 1-3
bytes[0] == 0xF0 &&
(0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// planes 4-15
(0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// plane 16
bytes[0] == 0xF4 &&
(0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
)
) {
bytes += 4;
continue;
}
return 0;
}
return 1;
}
请注意,这是 W3C 推荐的用于表单验证的正则表达式的忠实翻译,它确实拒绝了一些有效的 UTF-8 序列(特别是那些包含 ASCII 控制字符的序列).
Please note that this is a faithful translation of the regular expression recommended by W3C for form validation, which does indeed reject some valid UTF-8 sequences (in particular those containing ASCII control characters).
此外,即使在通过评论中提到的更改修复此问题后,它仍然假设零终止,这可以防止嵌入 NUL 字符,尽管它在技术上应该是合法的.
Also, even after fixing this by making the change mentioned in the comment, it still assumes zero-termination, which prevents embedding NUL characters, although it should technically be legal.
当我涉足创建自己的字符串库时,我使用了修改后的 UTF-8(即将 NUL 编码为一个超长的两字节序列) - 随意使用 此标头 作为模板,用于提供没有上述缺点的验证例程.
When I dabbled in creating my own string library, I went with modified UTF-8 (ie encoding NUL as an overlong two-byte sequence) - feel free to use this header as a template for providing a validation routine which doesn't suffer from the above shortcomings.
这篇关于如何在纯 C 中检测 UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!