UTF-8“变宽编码"是如何实现的工作? [英] How does UTF-8 "variable-width encoding" work?

查看:21
本文介绍了UTF-8“变宽编码"是如何实现的工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

unicode 标准中有足够的代码点,您需要 4 个字节来存储它们.这就是 UTF-32 编码的作用.然而,UTF-8 编码通过使用称为可变宽度编码"的东西以某种方式将它们压缩到更小的空间中.

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width encoding".

事实上,它设法在一个字节中表示 US-ASCII 的前 127 个字符,看起来与真正的 ASCII 完全一样,因此您可以将大量 ascii 文本解释为 UTF-8,而无需对其进行任何处理.巧妙的把戏.那么它是如何工作的?

In fact, it manages to represent the first 127 characters of US-ASCII in just one byte which looks exactly like real ASCII, so you can interpret lots of ascii text as if it were UTF-8 without doing anything to it. Neat trick. So how does it work?

我将在这里提出并回答我自己的问题,因为我只是阅读了一些资料来弄清楚它,我认为这可能会为其他人节省一些时间.另外,如果我有一些错误,也许有人可以纠正我.

I'm going to ask and answer my own question here because I just did a bit of reading to figure it out and I thought it might save somebody else some time. Plus maybe somebody can correct me if I've got some of it wrong.

推荐答案

每个字节以几个位开始,告诉您它是单字节代码点、多字节代码点还是多字节代码点的延续字节码点.像这样:

Each byte starts with a few bits that tell you whether it's a single byte code-point, a multi-byte code point, or a continuation of a multi-byte code point. Like this:

0xxx xxxx    A single-byte US-ASCII code (from the first 127 characters)

每个多字节代码点都以几位开头,基本上是说嘿,您还需要读取下一个字节(或两个或三个)来弄清楚我是什么."它们是:

The multi-byte code-points each start with a few bits that essentially say "hey, you need to also read the next byte (or two, or three) to figure out what I am." They are:

110x xxxx    One more byte follows
1110 xxxx    Two more bytes follow
1111 0xxx    Three more bytes follow

最后,这些起始码后面的字节都是这样的:

Finally, the bytes that follow those start codes all look like this:

10xx xxxx    A continuation of one of the multi-byte characters

由于您可以从前几位看出您正在查看的字节类型,因此即使某处某处损坏,您也不会丢失整个序列.

Since you can tell what kind of byte you're looking at from the first few bits, then even if something gets mangled somewhere, you don't lose the whole sequence.

这篇关于UTF-8“变宽编码"是如何实现的工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆