UTF-8 和 Unicode 有什么区别? [英] What is the difference between UTF-8 and Unicode?

查看:18
本文介绍了UTF-8 和 Unicode 有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从人们那里听到了相互矛盾的意见 - 根据 维基百科 UTF-8 页面.

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.

它们是一样的,不是吗?有人能澄清一下吗?

They are the same thing, aren't they? Can someone clarify?

推荐答案

扩展其他人给出的答案:

To expand on the answers others have given:

我们有很多语言都有很多字符,计算机应该可以理想地显示这些字符.Unicode 为每个字符分配一个唯一的数字或代码点.

We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.

计算机处理诸如字节之类的数字......这里跳过一些历史并忽略内存寻址问题,8 位计算机会将 8 位字节视为硬件上容易表示的最大数字单位,16 位计算机会将其扩展为两个字节,依此类推.

Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.

旧的字符编码(例如 ASCII)来自(前)8 位时代,并试图将当时的主要计算语言(即英语)填充到 0 到 127(7 位)之间的数字中.字母表中有 26 个字母,包括大写和非大写形式、数字和标点符号,效果非常好.对于其他非英语语言,ASCII 扩展了第 8 位,但此扩展提供的额外 128 个数字/代码点将根据显示的语言映射到不同的字符.ISO-8859 标准是这种映射最常见的形式;ISO-8859-1 和 ISO-8859-15(也称为 ISO-Latin-1、latin1,是的,8859 ISO 标准也有两个不同版本).

Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).

但是当您想要表示来自多种语言的字符时,这还不够,因此将所有可用字符塞入一个字节是行不通的.

But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work.

本质上有两种不同类型的编码:一种是通过添加更多位来扩展值范围.这些编码的示例是 UCS2(2 字节 = 16 位)和 UCS4(4 字节 = 32 位).它们本质上与 ASCII 和 ISO-8859 标准存在相同的问题,因为它们的值范围仍然有限,即使限制要高得多.

There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as the ASCII and ISO-8859 standards, as their value range is still limited, even if the limit is vastly higher.

另一种类型的编码使用每个字符的可变字节数,最常见的编码是 UTF 编码.所有 UTF 编码的工作方式大致相同:您选择一个单位大小,对于 UTF-8 是 8 位,对于 UTF-16 是 16 位,对于 UTF-32 是 32 位.然后,标准将这些位中的一些定义为标志:如果它们被设置,则单元序列中的下一个单元将被视为同一字符的一部分.如果未设置,则该单位完全代表一个字符.因此,最常见的(英文)字符在 UTF-8 中只占用 1 个字节(UTF-16 中为 2 个,UTF-32 中为 4 个),而其他语言的字符则可以占用 6 个字节或更多.

The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.

多字节编码(上面解释后应该说是多单元)的优点是它们相对节省空间,但缺点是查找子串、比较等操作都必须对字符进行解码在执行此类操作之前对代码点进行 Unicode 编码(不过有一些快捷方式).

Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).

UCS 标准和 UTF 标准都对 Unicode 中定义的代码点进行编码.理论上,这些编码可用于编码任何数字(在编码支持的范围内)——当然,这些编码是用来编码 Unicode 代码点的.这就是你们之间的关系.

Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that's your relationship between them.

Windows 将所谓的Unicode"字符串处理为 UTF-16 字符串,而如今大多数 UNIX 都默认使用 UTF-8.HTTP 等通信协议最适合使用 UTF-8,因为 UTF-8 中的单位大小与 ASCII 中的单位大小相同,而且大多数此类协议都是在 ASCII 时代设计的.另一方面,UTF-16 在表示所有现存语言时提供了最佳的平均空间/处理性能.

Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performance when representing all living languages.

Unicode 标准定义的代码点少于 32 位可以表示的代码点.因此,出于所有实际目的,UTF-32 和 UCS4 变成了相同的编码,因为您不太可能需要处理 UTF-32 中的多单元字符.

The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 became the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32.

希望能补充一些细节.

这篇关于UTF-8 和 Unicode 有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆