UTF-8 编码如何识别单字节和双字节字符? [英] How does UTF-8 encoding identify single byte and double byte characters?

查看:64
本文介绍了UTF-8 编码如何识别单字节和双字节字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我遇到了一个关于字符编码的问题,当我深入研究字符集和字符编码时,我想到了这个疑问.UTF-8 编码最受欢迎,因为它向后兼容 ASCII.因为 UTF-8是变长编码格式,它是如何区分单字节和双字节字符的.例如,A"存储为410754"(A的Unicode是41,阿拉伯字符的Unicode是0754.编码如何识别41是一个字符和0754又是一个二字节字符?为什么不把4107当作双字节字符,把54当作单字节字符?

Recently I've faced an issue regarding character encoding, while I was digging into character set and character encoding this doubt came to my mind.UTF-8 encoding is most popular because of its backward compatibility with ASCII.Since UTF-8 is variable length encoding format, how it differentiates single byte and double byte characters.For example, "Aݔ" is stored as "410754" (Unicode for A is 41 and Unicode for Arabic character is 0754.How encoding identifies 41 is one character and 0754 is another two-byte character?Why it's not considered as 4107 as one double byte character and 54 as a single byte character?

推荐答案

例如,A撵";存储为410754"

For example, "Aݔ" is stored as "410754"

UTF-8 不是这样工作的.

That’s not how UTF-8 works.

字符 U+0000 到 U+007F(又名 ASCII)存储为单个字节.它们是唯一的代码点在数字上与其 UTF-8 表示相匹配的字符.例如,U+0041 变成 0x41,也就是二进制的 01000001.

Characters U+0000 through U+007F (aka ASCII) are stored as single bytes. They are the only characters whose codepoints numerically match their UTF-8 presentation. For example, U+0041 becomes 0x41 which is 01000001 in binary.

所有其他字符都用多个字节表示.U+0080~U+07FF各用2个字节,U+0800~U+FFFF各用3个字节,U+10000~U+10FFFF各用4个字节.

All other characters are represented with multiple bytes. U+0080 through U+07FF use two bytes each, U+0800 through U+FFFF use three bytes each, and U+10000 through U+10FFFF use four bytes each.

计算机知道一个字符的结束位置和下一个字符的开始位置,因为 UTF-8 被设计为用于 ASCII 的单字节值与用于多字​​节序列的值不重叠.字节 0x000x7F 仅用于 ASCII,没有其他用途;0x7F 以上的字节仅用于多字节序列,没有其他用途.此外,在多字节序列开头使用的字节也不能出现在这些序列中的任何其他位置.

Computers know where one character ends and the next one starts because UTF-8 was designed so that the single-byte values used for ASCII do not overlap with those used in multi-byte sequences. The bytes 0x00 through 0x7F are only used for ASCII and nothing else; the bytes above 0x7F are only used for multi-byte sequences and nothing else. Furthermore, the bytes that are used at the beginning of the multi-byte sequences also cannot occur in any other position in those sequences.

因此需要对代码点进行编码.考虑以下二进制模式:

Because of that the codepoints need to be encoded. Consider the following binary patterns:

  • 2 个字节:110xxxxx 10xxxxxx
  • 3 个字节:1110xxxx 10xxxxxx 10xxxxxx
  • 4 个字节:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

第一个字节的个数告诉你后面有多少字节仍然属于同一个字符.属于该序列的所有字节都以二进制的 10 开头.要对字符进行编码,请将其代码点转换为二进制并填充 x.

The amount of ones in the first byte tells you how many of the following bytes still belong to the same character. All bytes that belong to the sequence start with 10 in binary. To encode the character you convert its codepoint to binary and fill in the x’s.

举个例子:U+0754在U+0080和U+07FF之间,所以需要两个字节.0x0754 在二进制中是 11101010100,所以你用这些数字替换 x:

As an example: U+0754 is between U+0080 and U+07FF, so it needs two bytes. 0x0754 in binary is 11101010100, so you replace the x’s with those digits:

11011101 10010100

这篇关于UTF-8 编码如何识别单字节和双字节字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆