如何在纯 C 中检测 UTF-8? [英] How to detect UTF-8 in plain C?

查看:14
本文介绍了如何在纯 C 中检测 UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个普通的旧 C 代码片段,用于检测给定的字符串是 UTF-8 编码.我知道使用正则表达式的解决方案,但出于各种原因,在这种特殊情况下最好避免使用除纯 C 以外的任何内容.

I am looking for a code snippet in plain old C that detects that the given string is in UTF-8 encoding. I know the solution with regex, but for various reasons it would be better to avoid using anything but plain C in this particular case.

使用正则表达式的解决方案如下所示(警告:省略了各种检查):

Solution with regex looks like this (warning: various checks omitted):

#define UTF8_DETECT_REGEXP  "^([x09x0Ax0Dx20-x7E]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})*$"

const char *error;
int         error_off;
int         rc;
int         vect[100];

utf8_re = pcre_compile(UTF8_DETECT_REGEXP, PCRE_CASELESS, &error, &error_off, NULL);
utf8_pe = pcre_study(utf8_re, 0, &error);

rc = pcre_exec(utf8_re, utf8_pe, str, len, 0, 0, vect, sizeof(vect)/sizeof(vect[0]));

if (rc > 0) {
    printf("string is in UTF8
");
} else {
    printf("string is not in UTF8
")
}

推荐答案

这是 这个表达式在普通 C:

_Bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}

请注意,这是 W3C 推荐的用于表单验证的正则表达式的忠实翻译,它确实拒绝了一些有效的 UTF-8 序列(特别是那些包含 ASCII 控制字符的序列).

Please note that this is a faithful translation of the regular expression recommended by W3C for form validation, which does indeed reject some valid UTF-8 sequences (in particular those containing ASCII control characters).

此外,即使在通过评论中提到的更改修复此问题后,它仍然假设零终止,这可以防止嵌入 NUL 字符,尽管它在技术上应该是合法的.

Also, even after fixing this by making the change mentioned in the comment, it still assumes zero-termination, which prevents embedding NUL characters, although it should technically be legal.

当我涉足创建自己的字符串库时,我使用了修改后的 UTF-8(即将 NUL 编码为一个超长的两字节序列) - 随意使用 此标头 作为模板,用于提供没有上述缺点的验证例程.

When I dabbled in creating my own string library, I went with modified UTF-8 (ie encoding NUL as an overlong two-byte sequence) - feel free to use this header as a template for providing a validation routine which doesn't suffer from the above shortcomings.

这篇关于如何在纯 C 中检测 UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆