UTF-8 &Unicode,0xC0 和 0x80 是什么? [英] UTF-8 & Unicode, what's with 0xC0 and 0x80?

查看:38
本文介绍了UTF-8 &Unicode,0xC0 和 0x80 是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去几天我一直在阅读有关 Unicode 和 UTF-8 的文章,我经常遇到类似的按位比较:

I've been reading about Unicode and UTF-8 in the last couple of days and I often come across a bitwise comparison similar to this :

int strlen_utf8(char *s) 
{
  int i = 0, j = 0;
  while (s[i]) 
  {
    if ((s[i] & 0xc0) != 0x80) j++;
    i++;
  }
  return j;
}

有人可以澄清与 0xc0 的比较并检查它是否是最重要的位吗?

Can someone clarify the comparison with 0xc0 and checking if it's the most significant bit ?

谢谢!

ANDed,不是比较,使用了错误的词;)

ANDed, not comparison, used the wrong word ;)

推荐答案

不是与 0xc0 的比较,而是与 0xc0 的逻辑 AND 运算.

It's not a comparison with 0xc0, it's a logical AND operation with 0xc0.

位掩码 0xc011 00 00 00 所以 AND 所做的只是提取前两位:

The bit mask 0xc0 is 11 00 00 00 so what the AND is doing is extracting only the top two bits:

    ab cd ef gh
AND 11 00 00 00
    -- -- -- --
  = ab 00 00 00

然后将其与 0x80(二进制 10 00 00 00)进行比较.换句话说,if 语句正在检查值的前两位是否不等于 10.

This is then compared to 0x80 (binary 10 00 00 00). In other words, the if statement is checking to see if the top two bits of the value are not equal to 10.

为什么?",我听到你问.嗯,这是个好问题.答案是,在 UTF-8 中,所有以位模式 10 开头的字节都是多字节序列的后续字节:

"Why?", I hear you ask. Well, that's a good question. The answer is that, in UTF-8, all bytes that begin with the bit pattern 10 are subsequent bytes of a multi-byte sequence:

                    UTF-8
Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000000-U+00007f  0xxxxxxx  0xxxxxxx

U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                   10yyyyxx
                   10xxxxxx

U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                   10zzyyyy
                   10yyyyxx
                   10xxxxxx

所以,这个小片段的作用是遍历 UTF-8 字符串的每个字节,并计算所有不是连续字节的字节(即,它正在获取字符串的长度,如宣传的那样).请参阅此维基百科链接了解更多详情和Joel Spolsky 的优秀文章,作为入门读物.

So, what this little snippet is doing is going through every byte of your UTF-8 string and counting up all the bytes that aren't continuation bytes (i.e., it's getting the length of the string, as advertised). See this wikipedia link for more detail and Joel Spolsky's excellent article for a primer.

顺便说一句,一个有趣的旁白.您可以按如下方式对 UTF-8 流中的字节进行分类:

An interesting aside by the way. You can classify bytes in a UTF-8 stream as follows:

  • 高位设置为0,它是一个单字节值.
  • 将两个高位设置为 10,这是一个连续字节.
  • 否则,它是多字节序列的第一个字节,前导 1 位的数量表示该序列总共有多少字节(110...code> 表示两个字节,1110... 表示三个字节等).
  • With the high bit set to 0, it's a single byte value.
  • With the two high bits set to 10, it's a continuation byte.
  • Otherwise, it's the first byte of a multi-byte sequence and the number of leading 1 bits indicates how many bytes there are in total for this sequence (110... means two bytes, 1110... means three bytes, etc).

这篇关于UTF-8 &Unicode,0xC0 和 0x80 是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆