如何UTF-8“可变宽度编码”工作? [英] How does UTF-8 "variable-width encoding" work?

查看:160
本文介绍了如何UTF-8“可变宽度编码”工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

unicode标准有足够的代码点,你需要4个字节来存储它们。这就是UTF-32编码。然而,UTF-8编码以某种方式通过使用称为可变宽度编码的东西将它们挤压到更小的空间中。

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width encoding".

事实上,它管理在一个字节中表示US-ASCII的前127个字符,看起来像真正的ASCII,所以你可以解释很多ascii文本如果是UTF-8没有做任何事情。整洁的技巧。那么它是如何工作的?

In fact, it manages to represent the first 127 characters of US-ASCII in just one byte which looks exactly like real ASCII, so you can interpret lots of ascii text as if it were UTF-8 without doing anything to it. Neat trick. So how does it work?

我会问这里回答我自己的问题,因为我只是做了一点阅读,想出来,我认为它可能节省别人一些时间。也许有人可以纠正我,如果我有一些错误。

I'm going to ask and answer my own question here because I just did a bit of reading to figure it out and I thought it might save somebody else some time. Plus maybe somebody can correct me if I've got some of it wrong.

推荐答案

每个字节以几个位开始,告诉您它是一个单字节代码点,多字节代码点,或多字节代码点的延续。像这样:

Each byte starts with a few bits that tell you whether it's a single byte code-point, a multi-byte code point, or a continuation of a multi-byte code point. Like this:

0xxx xxxx    A single-byte US-ASCII code (from the first 127 characters)

多字节代码点都从几个位开始,基本上说嘿,你还需要读取下一个字节(或两个或三个)来找出我是什么。它们是:

The multi-byte code-points each start with a few bits that essentially say "hey, you need to also read the next byte (or two, or three) to figure out what I am." They are:

110x xxxx    One more byte follows
1110 xxxx    Two more bytes follow
1111 0xxx    Three more bytes follow

最后,这些开始代码后面的字节都是这样:

Finally, the bytes that follow those start codes all look like this:

10xx xxxx    A continuation of one of the multi-byte characters

因为你可以知道你从前几位看到的是什么类型的字节,那么即使某些东西在某处被破坏,你也不会丢失整个序列。

Since you can tell what kind of byte you're looking at from the first few bits, then even if something gets mangled somewhere, you don't lose the whole sequence.

这篇关于如何UTF-8“可变宽度编码”工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆