如何通过指针读取UTF-8字符? [英] How do I read UTF-8 characters via a pointer?
问题描述
假设我将UTF-8内容存储在内存中,如何使用指针读取字符?我想我需要监视第8位表示一个多字节字符,但我该如何将序列转换为有效的Unicode字符?此外, wchar_t
是否存储单个Unicode字符的正确类型?
Suppose I have UTF-8 content stored in memory, how do I read the characters using a pointer? I presume I need to watch for the 8th bit indicating a multi-byte character, but how exactly do I turn the sequence into a valid Unicode character? Also, is wchar_t
the proper type to store a single Unicode character?
这是我的想法:
wchar_t readNextChar (char*& p)
{
wchar_t unicodeChar;
char ch = *p++;
if ((ch & 128) != 0)
{
// This is a multi-byte character, what do I do now?
// char chNext = *p++;
// ... but how do I assemble the Unicode character?
...
}
...
return unicodeChar;
}
推荐答案
以将UTF-8位模式解码为其未编码的UTF-32表示。如果你想要实际的Unicode代码点,那么 wchar_t
不够大,足以容纳它,你必须使用unsigned int / long代替,例如:
You have to decode the UTF-8 bit pattern to its unencoded UTF-32 representation. If you want the actual Unicode codepoint, then a wchar_t
is NOT large enough to hold it, you have to use an unsigned int/long instead, ie:
#define IS_IN_RANGE(c, f, l) (((c) >= (f)) && ((c) <= (l)))
u_long readNextChar (char*& p)
{
// TODO: since UTF-8 is a variable-length
// encoding, you should pass in the input
// buffer's actual byte length so that you
// can determine if a malformed UTF-8
// sequence would exceed the end of the buffer...
u_char c1, c2, *ptr = (uchar*) p;
u_long uc = 0;
int seqlen;
// int datalen = ... available length of p ...;
/*
if( datalen < 1 )
{
// malformed data, do something !!!
return (u_long) -1;
}
*/
c1 = ptr[0];
if( (c1 & 0x80) == 0 )
{
uc = (u_long) (c1 & 0x7F);
seqlen = 1;
}
else if( (c1 & 0xE0) == 0xC0 )
{
uc = (u_long) (c1 & 0x1F);
seqlen = 2;
}
else if( (c1 & 0xF0) == 0xE0 )
{
uc = (u_long) (c1 & 0x0F);
seqlen = 3;
}
else if( (c1 & 0xF8) == 0xF0 )
{
uc = (u_long) (c1 & 0x07);
seqlen = 4;
}
else
{
// malformed data, do something !!!
return (u_long) -1;
}
/*
if( seqlen > datalen )
{
// malformed data, do something !!!
return (u_long) -1;
}
*/
for(int i = 1; i < seqlen; ++i)
{
c1 = ptr[i];
if( (c1 & 0xC0) != 0x80 )
{
// malformed data, do something !!!
return (u_long) -1;
}
}
switch( seqlen )
{
case 2:
{
c1 = ptr[0];
if( !IS_IN_RANGE(c1, 0xC2, 0xDF) )
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
}
case 3:
{
c1 = ptr[0];
c2 = ptr[1];
if( ((c1 == 0xE0) && !IS_IN_RANGE(c2, 0xA0, 0xBF)) ||
((c1 == 0xED) && !IS_IN_RANGE(c2, 0x80, 0x9F)) ||
(!IS_IN_RANGE(c1, 0xE1, 0xEC) && !IS_IN_RANGE(c1, 0xEE, 0xEF)) )
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
}
case 4:
{
c1 = ptr[0];
c2 = ptr[1];
if( ((c1 == 0xF0) && !IS_IN_RANGE(c2, 0x90, 0xBF)) ||
((c1 == 0xF4) && !IS_IN_RANGE(c2, 0x80, 0x8F)) ||
!IS_IN_RANGE(c1, 0xF1, 0xF3) )
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
}
}
for(int i = 1; i < seqlen; ++i)
{
uc = ((uc << 6) | (u_long)(ptr[i] & 0x3F));
}
p += seqlen;
return unicodeChar;
}
使用 wchar_t
只有在处理UTF-16代码单元时。
Use a wchar_t
only when dealing with UTF-16 codeunits instead.
这篇关于如何通过指针读取UTF-8字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!