UTF-8与Unicode [英] UTF-8 vs. Unicode

查看:146
本文介绍了UTF-8与Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据维基百科,看到这里,我听到了人们的矛盾意见。 / p>

他们是一样的,不是吗?有人可以澄清吗?

解决方案

扩展其他人给出的答案:



我们有很多很多字符的语言,计算机应该理想地显示。 Unicode为每个字符分配一个唯一的编号或代码点。



计算机处理这样的数字,如字节...在这里跳过一点历史,忽略内存寻址问题,8位计算机将8位字节视为在硬件上容易表示的最大数值单位,16位计算机将扩展到两个字节,等等。



诸如ASCII之类的旧字符编码来自(前)8位时代,并尝试将当时计算中的主流语言(即英语)填充为从0到127(7位)的数字。字母表中有26个字母,无论是资本形式还是非资本形式,数字和标点符号,效果都很好。 ASCII扩展了其他非英语语言的第8位,但是通过此扩展提供的附加128个数字/代码点将根据显示的语言映射到不同的字符。 ISO-8859标准是这种映射的最常见形式; ISO-8859-1和ISO-8859-15(也称为ISO-Latin-1,latin1,是的,有两个不同版本的8859 ISO标准)。



但是,如果您想要代表多种语言的字符,那么将所有可用的字符填充到单个字节就不行了。



基本上有两种不同类型的编码:一种通过添加更多位来扩展值范围。这些编码的例子是UCS2(2字节= 16位)和UCS4(4字节= 32位)。它们与ASCII和ISO-8859标准本身具有相同的问题,因为它们的价值范围仍然受到限制,即使极限更高。



另一种类型编码使用每个字符的可变字节数,并且最常见的编码是UTF编码。所有UTF编码的工作方式大致相同:您选择单位大小,UTF-8为8位,UTF-16为16位,UTF-32为32位。该标准然后将这些位中的几个定义为标志:如果它们被设置,则单位序列中的下一个单元将被认为是相同字符的一部分。如果没有设置,该单元将完全代表一个字符。因此,最常见的(英文)字符仅占用UTF-8中的一个字节(UTF-16中为两个字节,UTF-32中为4个字节),但其他语言字符可占用6个字节以上。



多字节编码(在上述说明之后,我应该说多单位)具有相对空间有效的优点,但是发现子串,比较等操作的缺点都必须在执行此类操作之前将字符解码为unicode代码点(有一些快捷方式)。



UCS标准和UTF标准均将代码点编码为以Unicode定义。在理论上,这些编码可以用于编码任何数字(在编码支持的范围内) - 但是当然这些编码被用来编码Unicode码点。这就是你之间的关系。



Windows将所谓的Unicode字符串作为UTF-16字符串处理,而大多数UNIX默认为UTF-8。像UTF-8这样的通信协议可能最适合使用UTF-8,因为UTF-8中的单元大小与ASCII相同,大多数这样的协议都是以ASCII时代设计的。另一方面,UTF-16在表示所有生活语言时,提供了最佳的平均空间/处理性能。



Unicode标准定义了更少的代码点数可以用32位表示。因此,对于所有实际目的,UTF-32和UCS4变得相同的编码,因为您不太可能在UTF-32中处理多单位字符。



希望填写一些细节。


I have heard conflicting opinions from people - according to Wikipedia, see here.

They are the same thing, aren't they? Can someone clarify?

解决方案

To expand on the answers others have given:

We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.

Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.

Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).

But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work.

There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as ASCII and ISO-8859 standars, as their value range is still limited, even if the limit is vastly higher.

The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.

Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).

Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that's your relationship between them.

Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performance when representing all living languages.

The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 became the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32.

Hope that fills in some details.

这篇关于UTF-8与Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆