UTF-8编码如何识别单字节和双字节字符? [英] How does UTF-8 encoding identify single byte and double byte characters?

查看:1381
本文介绍了UTF-8编码如何识别单字节和双字节字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我遇到了一个关于字符编码的问题,而我正在挖掘字符集和字符编码这个疑问来到了我的脑海.UTF-8编码是最受欢迎的,因为它与ASCII的向后兼容。因为UTF-8是可变长度编码格式,它如何区分单字节和双字节字符。例如,Aݔ被存储为410754(Unicode为A为41,Unicode为阿拉伯字符为0754.How编码标识41为一个字符, 0754是另一个两个字节的字符?为什么它不被认为是4107作为一个双字节字符,54作为单字节字符?

解决方案

blockquote>

例如,Aݔ被存储为410754


这不是UTF-8



字符U + 0000到U + 007F(又称ASCII)被存储为单个字节,它们是代码点数字匹配其UTF-8演示文稿的唯一字符。例如, U + 0041成为 0x41 ,它是二进制的 0100001



所有其他字符都用多个字节表示。 U + 0080通过U + 07FF每个使用两个字节,U + 0800到U + FFFF每个使用三个字节,U + 10000到U + 10FFFF每个使用四个字节。



计算机知道一个字符在哪里结束,下一个启动,因为UTF-8被设计为使用于ASCII的单字节值不与多字节序列中使用的单字节值重叠。通过 0x7F 的字节 0x00 仅用于ASCII,没有其他; 0x7F 之外的字节仅用于多字节序列,没有其他的。此外,在多字节序列开始时使用的字节也不会发生在那些序列中的任何其他位置。



因为代码点需要编码。考虑以下二进制模式:




  • 2个字节: 110xxxxx 10xxxxxx

  • 3个字节: 1110xxxx 10xxxxxx 10xxxxxx

  • 4个字节: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx



第一个字节中的数字会告诉您有多少个以下字节仍然属于相同的字符属于序列的所有字节以二进制的 10 开始。为了对字符进行编码,将其代码点转换为二进制并填写x。



举个例子:U + 0754在U + 0080和U + 07FF之间,所以它需要两个字节。二进制中的 0x0754 11101010100 ,所以用这些数字替换x:



110 11101 10 010100


Recently I've faced an issue regarding character encoding, while I was digging into character set and character encoding this doubt came to my mind.UTF-8 encoding is most popular because of its backward compatibility with ASCII.Since UTF-8 is variable length encoding format, how it differentiates single byte and double byte characters.For example, "Aݔ" is stored as "410754" (Unicode for A is 41 and Unicode for Arabic character is 0754.How encoding identifies 41 is one character and 0754 is another two-byte character?Why it's not considered as 4107 as one double byte character and 54 as a single byte character?

解决方案

For example, "Aݔ" is stored as "410754"

That's not how UTF-8 works.

Characters U+0000 through U+007F (aka ASCII) are stored as single bytes. They are the only characters whose codepoints numerically match their UTF-8 presentation. For example, U+0041 becomes 0x41 which is 0100001 in binary.

All other characters are represented with multiple bytes. U+0080 through U+07FF use two bytes each, U+0800 through U+FFFF use three bytes each, and U+10000 through U+10FFFF use four bytes each.

Computers know where one character ends and the next one starts because UTF-8 was designed so that the single-byte values used for ASCII do not overlap with those used in multi-byte sequences. The bytes 0x00 through 0x7F are only used for ASCII and nothing else; the bytes above 0x7F are only used for multi-byte sequences and nothing else. Furthermore, the bytes that are used at the beginning of the multi-byte sequences also cannot occur in any other position in those sequences.

Because of that the codepoints need to be encoded. Consider the following binary patterns:

  • 2 bytes: 110xxxxx 10xxxxxx
  • 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
  • 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The amount of ones in the first byte tells you how many of the following bytes still belong to the same character. All bytes that belong to the sequence start with 10 in binary. To encode the character you convert its codepoint to binary and fill in the x's.

As an example: U+0754 is between U+0080 and U+07FF, so it needs two bytes. 0x0754 in binary is 11101010100, so you replace the x's with those digits:

11011101 10010100

这篇关于UTF-8编码如何识别单字节和双字节字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆