如何轻松检测字符串中的utf8编码? [英] How to easily detect utf8 encoding in the string?

查看：123 发布时间：2021/5/4 19:15:56 c++ windows string encoding utf-8

本文介绍了如何轻松检测字符串中的utf8编码?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个字符串，该字符串由其他程序中的数据填充，并且该数据可以使用或不使用UTF8编码.因此，如果不能，我可以编码为UTF8，但是在C ++中检测UTF8的最佳方法是什么?我看到了这个变体 https://stackoverflow.com/questions/.，但有评论说此解决方案无法100％地进行检测.因此，如果我对已经包含UTF8数据的UTF8字符串进行编码，那么我会将错误的文本写入数据库.

I have string which fill up by data from other program and this data can be with UTF8 encoding or not. So if not i can encode to UTF8 but what is the best way to detect UTF8 in the C++? I saw this variant https://stackoverflow.com/questions/... but there are comments which said that this solutions give not 100% detection. So if i do encoding to UTF8 string which already contain UTF8 data then i write wrong text to database.

所以我可以只使用这种UTF8检测:

So can i just use this UTF8 detection :

bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}

如果检测结果不正确，则此代码可编码为UTF8:

And this code for encoding to UTF8 if detection is not true :

     string text;
     if(!is_utf8(EscReason.c_str()))
     {
        int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), 0, 0);
        std::wstring utf16_str(size, '\0');

        MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), &utf16_str[0], size);
    
        int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), 0, 0, 0, 0);

        std::string utf8_str(utf8_size, '\0');
        WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), &utf8_str[0], utf8_size, 0, 0);

        text = utf8_str;
     }

还是上面的代码没有正确完成?我也在Windows 7中做到这一点.Ubuntu又如何呢?这个变体在那里有用吗?

Or code above is not done properly? Also i do it in the Windows 7. And how about Ubuntu? Does this variant work there?

如何轻松检测字符串中的utf8编码? [英] How to easily detect utf8 encoding in the string?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何轻松检测字符串中的utf8编码? [英] How to easily detect utf8 encoding in the string?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭