我该如何“解码"?一个UTF-8字符? [英] How do I "decode" a UTF-8 character?
问题描述
假设我要编写一个比较两个Unicode字符的函数.我该怎么办?我读了一些文章(例如此),但还是没有.让我们以€
作为输入.它在0x0800
和0xFFFF
范围内,因此它将使用3个字节对其进行编码.如何解码?按位运算从wchar_t
获取3个字节并存储到3个char
中?用C语言编写的示例代码可能很棒.
Let's assume I want to write a function to compare two Unicode characters. How should I do that? I read some articles around (like this) but still didn't got that. Let's take €
as input. It's in range 0x0800
and 0xFFFF
so it will use 3 bytes to encode it. How do I decode it? bitwise operation to get 3 bytes from wchar_t
and store into 3 char
s? A code in example in C could be great.
这是我的C代码解码",但显然显示错误的值来解码unicode ...
Here's my C code to "decode" but obviously show wrong value to decode unicode...
#include <stdio.h>
#include <wchar.h>
void printbin(unsigned n);
int length(wchar_t c);
void print(struct Bytes *b);
// support for UTF8 which encodes up to 4 bytes only
struct Bytes
{
char v1;
char v2;
char v3;
char v4;
};
int main(void)
{
struct Bytes bytes = { 0 };
wchar_t c = '€';
int len = length(c);
//c = 11100010 10000010 10101100
bytes.v1 = (c >> 24) << 4; // get first byte and remove leading "1110"
bytes.v2 = (c >> 16) << 5; // skip over first byte and get 000010 from 10000010
bytes.v3 = (c >> 8) << 5; // skip over first two bytes and 10101100 from 10000010
print(&bytes);
return 0;
}
void print(struct Bytes *b)
{
int v1 = (int) (b->v1);
int v2 = (int)(b->v2);
int v3 = (int)(b->v3);
int v4 = (int)(b->v4);
printf("v1 = %d\n", v1);
printf("v2 = %d\n", v2);
printf("v3 = %d\n", v3);
printf("v4 = %d\n", v4);
}
int length(wchar_t c)
{
if (c >= 0 && c < 0x007F)
return 1;
if (c >= 0x0080 && c <= 0x07FF)
return 2;
if (c >= 0x0800 && c <= 0xFFFF)
return 3;
if (c >= 0x10000 && c <= 0x1FFFFF)
return 4;
if (c >= 0x200000 && c <= 0x3FFFFFF)
return 5;
if (c >= 0x4000000 && c <= 0x7FFFFFFF)
return 6;
return -1;
}
void printbin(unsigned n)
{
if (!n)
return;
printbin(n >> 1);
printf("%c", (n & 1) ? '1' : '0');
}
推荐答案
比较UTF-8编码的字符并不容易.最好不要尝试.要么
It's not at all easy to compare UTF-8 encoded characters. Best not to try. Either:
-
将它们都转换为宽格式(32位整数),然后进行算术比较.请参见
wstring_convert
或您喜欢的特定于供应商的功能;请参见.或
Convert them both to a wide format (32 bit integer) and compare this arithmetically. See
wstring_convert
or your favorite vendor-specific function; or
将它们转换为1个字符串,并使用一个比较UTF-8编码字符串的函数.在C ++中没有标准的方法来执行此操作,但是它是其他语言(例如Ruby,PHP等)中的首选方法.
Convert them into 1 character strings and use a function that compares UTF-8 encoded strings. There is no standard way to do this in C++, but it is the preferred method in other languages such as Ruby, PHP, whatever.
为了清楚起见,很难做到的是获取编码为UTF_8的原始位/字节/字符并进行比较.这是因为您的比较必须考虑编码才能知道是比较8位,16位还是更多位.如果您可以通过某种方式将原始数据位转换为以零结尾的字符串,则使用常规字符串函数比较起来非常容易.该字符串的长度可能超过一个字节/八位字节,但它将表示单个字符/代码点.
Just to make it clear, the thing that is hard is to take raw bits/bytes/characters encoded as UTF_8 and compare them. This is because your comparison has to take account of the encoding to know whether to compare 8 bits, 16 bits or more. If you can somehow turn the raw data bits into a null-terminated string then the comparison is trivially easy using regular string functions. This string may be more than one byte/octet in length, but it will represent a single character/code point.
Windows有点特殊情况.宽字符是short int(16位).从历史上讲,这意味着UCS-2,但已将其重新定义为UTF-16.这意味着可以直接比较基本多语言平面(BMP)中的所有有效字符,因为它们将占据单个short int,而其他字符则不能.我不知道有什么简单的方法可以在Windows的BMP之外处理32位宽的字符(表示为简单的int).
Windows is a bit of a special case. Wide characters are short int (16-bit). Historically this meant UCS-2 but it has been redefined as UTF-16. This means that all valid characters in the Basic Multilingual Plane (BMP) can be compared directly, since they will occupy a single short int, but others cannot. I am not aware of any simple way to deal with 32-bit wide characters (represented as a simple int) outside the BMP on Windows.
这篇关于我该如何“解码"?一个UTF-8字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!