UTF-8和Unicode有什么区别? [英] What is the difference between UTF-8 and Unicode?

查看:167
本文介绍了UTF-8和Unicode有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Wikipedia UTF-8 页面,我听到人们的意见相互矛盾.

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.

他们是同一回事,不是吗?有人可以澄清吗?

They are the same thing, aren't they? Can someone clarify?

推荐答案

扩展其他人给出的答案:

To expand on the answers others have given:

我们有许多语言和许多字符,计算机应该理想地显示这些字符. Unicode为每个字符分配一个唯一的数字或代码点.

We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.

计算机处理字节之类的数字...在这里略过历史,而忽略了内存寻址问题,8位计算机会将8位字节视为硬件上容易表示的最大数字单位16位计算机会将其扩展到两个字节,依此类推.

Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.

诸如ASCII之类的旧字符编码来自(以前的)8位时代,并试图将当时计算中的主导语言(即英语)塞入0到127(7位)范围内的数字.字母表中有26个字母(大写和非大写形式),数字和标点符号,效果都很好.对于其他非英语语言,ASCII扩展了8位,但是此扩展提供的其他128个数字/代码点将根据所显示的语言映射到不同的字符. ISO-8859标准是此映射的最常见形式. ISO-8859-1和ISO-8859-15(也称为ISO-Latin-1,latin1,是的,还有8859 ISO标准的两个不同版本).

Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).

但是,当您要表示一种以上语言中的字符时,这还不够,因此将所有可用字符都塞进一个字节中是行不通的.

But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work.

基本上有两种不同的编码类型:一种通过添加更多位来扩展值范围.这些编码的示例是UCS2(2字节= 16位)和UCS4(4字节= 32位).它们的内在问题与ASCII和ISO-8859标准相同,因为它们的取值范围仍然受到限制,即使该限制要高得多.

There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as the ASCII and ISO-8859 standards, as their value range is still limited, even if the limit is vastly higher.

另一种编码方式是每个字符使用可变数量的字节,对此,最常见的编码是UTF编码.所有UTF编码的工作方式大致相同:您选择一个单位大小,对于UTF-8为8位,对于UTF-16为16位,对于UTF-32为32位.然后,标准将这些位中的一些定义为标志:如果设置了这些位,则将一系列单元中的下一个单元视为同一字符的一部分.如果未设置,则本单位将完全代表一个字符.因此,最常见的(英文)字符在UTF-8中仅占据一个字节(在UTF-16中为两个字节,在UTF-32中为4个字节),而其他语言字符则可以占据6个字节或更多字节.

The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.

多字节编码(在上面的解释之后我应该说多单元编码)具有相对节省空间的优点,但是不利之处在于诸如查找子字符串,比较等操作都必须对字符进行解码在执行此类操作之前先将代码点编码为unicode(尽管有一些快捷方式).

Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).

UCS标准和UTF标准都对Unicode中定义的代码点进行编码.从理论上讲,这些编码可用于编码任何数字(在编码支持的范围内)-但是,当然,这些编码是用来编码Unicode代码点的.那就是你之间的关系.

Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that's your relationship between them.

Windows将所谓的"Unicode"字符串作为UTF-16字符串处理,而如今大多数UNIX默认为UTF-8.由于UTF-8中的单位大小与ASCII中的单位大小相同,因此HTTP之类的通信协议往往与UTF-8一起使用效果最佳,并且大多数此类协议都是在ASCII时代设计的.另一方面,UTF-16在表示所有活动语言时提供最佳的平均值空间/处理性能.

Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performance when representing all living languages.

Unicode标准定义的代码点少于32位表示的代码点.因此,出于所有实际目的,UTF-32和UCS4成为相同的编码,因为您不必在UTF-32中处理多单元字符.

The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 became the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32.

希望填写一些细节.

这篇关于UTF-8和Unicode有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆