如何轻松检测字符串中的utf8编码? [英] How to easily detect utf8 encoding in the string?

查看:123
本文介绍了如何轻松检测字符串中的utf8编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串,该字符串由其他程序中的数据填充,并且该数据可以使用或不使用UTF8编码.因此,如果不能,我可以编码为UTF8,但是在C ++中检测UTF8的最佳方法是什么?我看到了这个变体 https://stackoverflow.com/questions/.,但有评论说此解决方案无法100%地进行检测.因此,如果我对已经包含UTF8数据的UTF8字符串进行编码,那么我会将错误的文本写入数据库.

I have string which fill up by data from other program and this data can be with UTF8 encoding or not. So if not i can encode to UTF8 but what is the best way to detect UTF8 in the C++? I saw this variant https://stackoverflow.com/questions/... but there are comments which said that this solutions give not 100% detection. So if i do encoding to UTF8 string which already contain UTF8 data then i write wrong text to database.

所以我可以只使用这种UTF8检测:

So can i just use this UTF8 detection :

bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}

如果检测结果不正确,则此代码可编码为UTF8:

And this code for encoding to UTF8 if detection is not true :

     string text;
     if(!is_utf8(EscReason.c_str()))
     {
        int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), 0, 0);
        std::wstring utf16_str(size, '\0');

        MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), &utf16_str[0], size);
    
        int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), 0, 0, 0, 0);

        std::string utf8_str(utf8_size, '\0');
        WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), &utf8_str[0], utf8_size, 0, 0);

        text = utf8_str;
     }

还是上面的代码没有正确完成?我也在Windows 7中做到这一点.Ubuntu又如何呢?这个变体在那里有用吗?

Or code above is not done properly? Also i do it in the Windows 7. And how about Ubuntu? Does this variant work there?

推荐答案

您可能不了解UTF-8及其替代方法.一个字节只有256个可能的值.给定字符数,这不是很多.结果,许多字节序列既是有效的UTF-8字符串,又是其他编码中的有效字符串.

You probably don't understand UTF-8 and the alternatives. There are only 256 possible values for a byte. That's not a lot, given the number of characters. As a result, many byte sequences are both valid UTF-8 strings and valid strings in other encodings.

实际上,每个ASCII字符串故意是一个有效的UTF-8字符串,其含义基本相同.您的代码将为ìs_utf8("Hello")返回 true .

In fact, every ASCII string is intentionally a valid UTF-8 string with essentially the same meaning. Your code would return true for ìs_utf8("Hello").

甚至许多其他非UTF8,非ASCII字符串也与有效的UTF-8字符串共享字节序列.而且,如果不确切地知道它是哪种非UTF-8编码,就无法将非UTF-8字符串转换为UTF-8.即使是Latin-1和Latin-2也已经大不相同. CP_ACP 甚至比Latin-1还差, CP_ACP 到处都不一样.

Even many other non-UTF8, non-ASCII strings share a byte sequences with valid UTF-8 strings. And there's just no way to convert a non-UTF-8 string to UTF-8 without knowing exactly what kind of non-UTF-8 encoding it is. Even Latin-1 and Latin-2 are already quite different. CP_ACP is even worse than Latin-1, CP_ACP isn't even the same everywhere.

您的文本必须以UTF-8格式进入数据库.因此,如果还不是UTF-8,则必须对其进行转换,并且您必须知道确切的源编码.没有神奇的逃生.

Your text must go into the database as UTF-8. Thus, if it isn't yet UTF-8, it must be converted, and you must know the exact source encoding. There is no magical escape.

在Linux上, iconv 是在两种编码之间进行转换的常用方法.

On Linux, iconv is the usual method to convert between 2 encodings.

这篇关于如何轻松检测字符串中的utf8编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆