UTF-8中的解码如何知道字节边界? [英] How does decoding in UTF-8 know the byte boundaries?

查看:160
本文介绍了UTF-8中的解码如何知道字节边界?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读unicode编码,尤其是关于Python。我想我现在对此有很深刻的理解,但是仍然有一个我不确定的小细节。

I've been doing a bunch of reading on unicode encodings, especially with regards to Python. I think I have a pretty strong understanding of it now, but there's still one small detail I'm a little unsure about.

解码如何知道字节边界?例如,假设我有一个带有两个Unicode字符的Unicode字符串,其字节表示形式为 \xc6\xb4 \xe2\x98\ \x82 。然后,我将此unicode字符串写入文件,因此该文件现在包含字节
\xc6\xb4\xe2\x98\x82 。现在,我决定打开并读取文件(Python默认将文件解码为utf-8),这引出了我的主要问题。

How does the decoding know the byte boundaries? For example, say I have a unicode string with two unicode characters with byte representations of \xc6\xb4 and \xe2\x98\x82, respectively. I then write this unicode string to a file, so the file now contains the bytes \xc6\xb4\xe2\x98\x82. Now I decide to open and read the file (and Python defaults to decoding the file as utf-8), which leads me to my main question.

解码如何知道如何解释字节 \xc6\xb4 而不是 \xc6\xb4\xe2

How does the decoding know to interpret the bytes \xc6\xb4 and not \xc6\xb4\xe2?

推荐答案

字节边界很容易从位模式中确定。在您的情况下, \xc6 1100 \xe2 1110 开头。在UTF-8中(我很确定这不是偶然的),您可以通过仅查看第一个字节并计算 1 来确定整个字符中的字节数。 code>位在第一个 0 之前的起始位置。因此,您的第一个字符有2个字节,第二个字符有3个字节。

The byte boundaries are easily determined from the bit patterns. In your case, \xc6 starts with the bits 1100, and \xe2 starts with 1110. In UTF-8 (and I'm pretty sure this is not an accident), you can determine the number of bytes in the whole character by looking only at the first byte and counting the number of 1 bits at the start before the first 0. So your first character has 2 bytes and the second one has 3 bytes.

如果一个字节以 0 开头,

If a byte starts with 0, it is a regular ASCII character.

如果字节以 10 开头,则它是UTF-8的一部分顺序(不是第一个字符)。

If a byte starts with 10, it is part of a UTF-8 sequence (not the first character).

这篇关于UTF-8中的解码如何知道字节边界?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆