UTF-8中的解码如何知道字节边界? [英] How does decoding in UTF-8 know the byte boundaries?
问题描述
我一直在阅读unicode编码,尤其是关于Python。我想我现在对此有很深刻的理解,但是仍然有一个我不确定的小细节。
I've been doing a bunch of reading on unicode encodings, especially with regards to Python. I think I have a pretty strong understanding of it now, but there's still one small detail I'm a little unsure about.
解码如何知道字节边界?例如,假设我有一个带有两个Unicode字符的Unicode字符串,其字节表示形式为 \xc6\xb4
和 \xe2\x98\ \x82
。然后,我将此unicode字符串写入文件,因此该文件现在包含字节
\xc6\xb4\xe2\x98\x82
。现在,我决定打开并读取文件(Python默认将文件解码为utf-8),这引出了我的主要问题。
How does the decoding know the byte boundaries? For example, say I have a unicode string with two unicode characters with byte representations of \xc6\xb4
and \xe2\x98\x82
, respectively. I then write this unicode string to a file, so the file now contains the bytes
\xc6\xb4\xe2\x98\x82
. Now I decide to open and read the file (and Python defaults to decoding the file as utf-8), which leads me to my main question.
解码如何知道如何解释字节 \xc6\xb4
而不是 \xc6\xb4\xe2
?
How does the decoding know to interpret the bytes \xc6\xb4
and not \xc6\xb4\xe2
?
推荐答案
字节边界很容易从位模式中确定。在您的情况下, \xc6
以 1100
和 \xe2
以 1110
开头。在UTF-8中(我很确定这不是偶然的),您可以通过仅查看第一个字节并计算 1 来确定整个字符中的字节数。 code>位在第一个
0
之前的起始位置。因此,您的第一个字符有2个字节,第二个字符有3个字节。
The byte boundaries are easily determined from the bit patterns. In your case, \xc6
starts with the bits 1100
, and \xe2
starts with 1110
. In UTF-8 (and I'm pretty sure this is not an accident), you can determine the number of bytes in the whole character by looking only at the first byte and counting the number of 1
bits at the start before the first 0
. So your first character has 2 bytes and the second one has 3 bytes.
如果一个字节以 0
开头,
If a byte starts with 0
, it is a regular ASCII character.
如果字节以 10
开头,则它是UTF-8的一部分顺序(不是第一个字符)。
If a byte starts with 10
, it is part of a UTF-8 sequence (not the first character).
这篇关于UTF-8中的解码如何知道字节边界?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!