UTF-8 连续字节 [英] UTF-8 Continuation bytes
问题描述
我试图弄清楚 UTF-8 编码中的连续字节"是什么(出于好奇).
I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding.
维基百科在UTF-8文章中引入了这个术语,但根本没有定义
Wikipedia introduces this term in the UTF-8 article without defining it at all
Google 搜索也没有返回任何有用的信息.我即将进入官方规范,但最好先阅读高级摘要.
Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.
推荐答案
UTF-8 中的连续字节是前两位为 10
的任何字节.
A continuation byte in UTF-8 is any byte where the top two bits are 10
.
它们是多字节序列中的后续字节.下表可能会有所帮助:
They are the subsequent bytes in multi-byte sequences. The following table may help:
Unicode code points Encoding Binary value
------------------- -------- ------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
在这里您可以看到 Unicode 代码点如何映射到 UTF-8 多字节字节序列,以及它们的等效二进制值.
Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.
基本规则是这样的:
- 如果一个字节以
0
位开始,则它是一个小于 128 的单字节值. - 如果以
11
开头,则是多字节序列的第一个字节,开头的1
位数表示总共有多少个字节(110xxxxx
有两个字节,1110xxxx
有三个字节,11110xxx
有四个字节). - 如果它以
10
开头,则它是一个连续字节.
- If a byte starts with a
0
bit, it's a single byte value less than 128. - If it starts with
11
, it's the first byte of a multi-byte sequence and the number of1
bits at the start indicates how many bytes there are in total (110xxxxx
has two bytes,1110xxxx
has three and11110xxx
has four). - If it starts with
10
, it's a continuation byte.
这种区别允许非常方便的处理,例如能够从序列中的任何字节进行备份以找到该代码点的第一个字节.只需向后搜索,直到找到一个不是以 10
位开头的.
This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10
bits.
同样,它也可以用于 UTF-8 strlen
,只计算非10xxxxxx
字节.
Similarly, it can also be used for a UTF-8 strlen
by only counting non-10xxxxxx
bytes.
这篇关于UTF-8 连续字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!