UTF-8 连续字节 [英] UTF-8 Continuation bytes

查看:31
本文介绍了UTF-8 连续字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图弄清楚 UTF-8 编码中的连续字节"是什么(出于好奇).

I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding.

维基百科在UTF-8文章中引入了这个术语,但根本没有定义

Wikipedia introduces this term in the UTF-8 article without defining it at all

Google 搜索也没有返回任何有用的信息.我即将进入官方规范,但最好先阅读高级摘要.

Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.

推荐答案

UTF-8 中的连续字节是前两位为 10 的任何字节.

A continuation byte in UTF-8 is any byte where the top two bits are 10.

它们是多字节序列中的后续字节.下表可能会有所帮助:

They are the subsequent bytes in multi-byte sequences. The following table may help:

Unicode code points  Encoding  Binary value
-------------------  --------  ------------
 U+000000-U+00007f   0xxxxxxx  0xxxxxxx

 U+000080-U+0007ff   110yyyxx  00000yyy xxxxxxxx
                     10xxxxxx

 U+000800-U+00ffff   1110yyyy  yyyyyyyy xxxxxxxx
                     10yyyyxx
                     10xxxxxx

 U+010000-U+10ffff   11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                     10zzyyyy
                     10yyyyxx
                     10xxxxxx

在这里您可以看到 Unicode 代码点如何映射到 UTF-8 多字节字节序列,以及它们的等效二进制值.

Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.

基本规则是这样的:

  1. 如果一个字节以 0 位开始,则它是一个小于 128 的单字节值.
  2. 如果以11开头,则是多字节序列的第一个字节,开头的1位数表示总共有多少个字节(110xxxxx 有两个字节,1110xxxx 有三个字节,11110xxx 有四个字节).
  3. 如果它以 10 开头,则它是一个连续字节.
  1. If a byte starts with a 0 bit, it's a single byte value less than 128.
  2. If it starts with 11, it's the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
  3. If it starts with 10, it's a continuation byte.

这种区别允许非常方便的处理,例如能够从序列中的任何字节进行备份以找到该代码点的第一个字节.只需向后搜索,直到找到一个不是以 10 位开头的.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

同样,它也可以用于 UTF-8 strlen,只计算非10xxxxxx 字节.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

这篇关于UTF-8 连续字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆