如何通过指针读取UTF-8字符？ [英] How do I read UTF-8 characters via a pointer?

查看：209 发布时间：2016/10/14 12:41:32 c++ unicode utf-8 character-encoding

本文介绍了如何通过指针读取UTF-8字符？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我将UTF-8内容存储在内存中，如何使用指针读取字符？我想我需要监视第8位表示一个多字节字符，但我该如何将序列转换为有效的Unicode字符？此外， wchar_t 是否存储单个Unicode字符的正确类型？

Suppose I have UTF-8 content stored in memory, how do I read the characters using a pointer? I presume I need to watch for the 8th bit indicating a multi-byte character, but how exactly do I turn the sequence into a valid Unicode character? Also, is wchar_t the proper type to store a single Unicode character?

这是我的想法：



   wchar_t readNextChar (char*& p)
   { 
       wchar_t unicodeChar;
       char ch = *p++;

       if ((ch & 128) != 0)
       {
           // This is a multi-byte character, what do I do now?
           // char chNext = *p++; 
           // ... but how do I assemble the Unicode character?   
           ...
       }
       ...
       return unicodeChar;
   }

推荐答案

以将UTF-8位模式解码为其未编码的UTF-32表示。如果你想要实际的Unicode代码点，那么 wchar_t 不够大，足以容纳它，你必须使用unsigned int / long代替，例如：

You have to decode the UTF-8 bit pattern to its unencoded UTF-32 representation. If you want the actual Unicode codepoint, then a wchar_t is NOT large enough to hold it, you have to use an unsigned int/long instead, ie:

#define IS_IN_RANGE(c, f, l)    (((c) >= (f)) && ((c) <= (l)))

u_long readNextChar (char*& p) 
{  
    // TODO: since UTF-8 is a variable-length
    // encoding, you should pass in the input
    // buffer's actual byte length so that you
    // can determine if a malformed UTF-8
    // sequence would exceed the end of the buffer...

    u_char c1, c2, *ptr = (uchar*) p;
    u_long uc = 0;
    int seqlen;
    // int datalen = ... available length of p ...;    

    /*
    if( datalen < 1 )
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }
    */

    c1 = ptr[0];

    if( (c1 & 0x80) == 0 )
    {
        uc = (u_long) (c1 & 0x7F);
        seqlen = 1;
    }
    else if( (c1 & 0xE0) == 0xC0 )
    {
        uc = (u_long) (c1 & 0x1F);
        seqlen = 2;
    }
    else if( (c1 & 0xF0) == 0xE0 )
    {
        uc = (u_long) (c1 & 0x0F);
        seqlen = 3;
    }
    else if( (c1 & 0xF8) == 0xF0 )
    {
        uc = (u_long) (c1 & 0x07);
        seqlen = 4;
    }
    else
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }

    /*
    if( seqlen > datalen )
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }
    */

    for(int i = 1; i < seqlen; ++i)
    {
        c1 = ptr[i];

        if( (c1 & 0xC0) != 0x80 )
        {
            // malformed data, do something !!!
            return (u_long) -1;
        }
    }

    switch( seqlen )
    {
        case 2:
        {
            c1 = ptr[0];

            if( !IS_IN_RANGE(c1, 0xC2, 0xDF) )
            {
                // malformed data, do something !!!
                return (u_long) -1;
            }

            break;
        }

        case 3:
        {
            c1 = ptr[0];
            c2 = ptr[1];

            if( ((c1 == 0xE0) && !IS_IN_RANGE(c2, 0xA0, 0xBF)) ||
                ((c1 == 0xED) && !IS_IN_RANGE(c2, 0x80, 0x9F)) ||
                (!IS_IN_RANGE(c1, 0xE1, 0xEC) && !IS_IN_RANGE(c1, 0xEE, 0xEF)) )
            {
                // malformed data, do something !!!
                return (u_long) -1;
            }

            break;
        }

        case 4:
        {
            c1 = ptr[0];
            c2 = ptr[1];

            if( ((c1 == 0xF0) && !IS_IN_RANGE(c2, 0x90, 0xBF)) ||
                ((c1 == 0xF4) && !IS_IN_RANGE(c2, 0x80, 0x8F)) ||
                !IS_IN_RANGE(c1, 0xF1, 0xF3) )
            {
                // malformed data, do something !!!
                return (u_long) -1;
            }

            break;
        }
    }

    for(int i = 1; i < seqlen; ++i)
    {
        uc = ((uc << 6) | (u_long)(ptr[i] & 0x3F));
    }

    p += seqlen;
    return unicodeChar; 
}

使用 wchar_t 只有在处理UTF-16代码单元时。

Use a wchar_t only when dealing with UTF-16 codeunits instead.

这篇关于如何通过指针读取UTF-8字符？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过指针读取UTF-8字符？ [英] How do I read UTF-8 characters via a pointer?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何通过指针读取UTF-8字符？ [英] How do I read UTF-8 characters via a pointer?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭