什么是 Unicode、UTF-8、UTF-16? [英] What is Unicode, UTF-8, UTF-16?
问题描述
Unicode 的基础是什么?为什么需要 UTF-8 或 UTF-16?我在谷歌上研究过这个,也在这里搜索过,但我不清楚.
What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it's not clear to me.
在 VSS 中进行文件比较时,有时会出现一条消息,指出两个文件的 UTF 不同.为什么会这样?
In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?
请用简单的语言解释.
推荐答案
为什么我们需要 Unicode?h2>
在(不是太早)早期,所有存在的都是 ASCII.这没关系,因为所需要的只是一些控制字符、标点符号、数字和字母,就像这句话中的那样.不幸的是,今天这个全球互通和社交媒体的奇怪世界并没有被预见到,看到英语、العربية、汉语、עִבְרִית、ελληνικά和ភាសាខ្ម៚៚៚មរ浏览器).
Why do we need Unicode?
In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).
但为了论证起见,我们假设 Joe Average 是一名软件开发人员.他坚持认为他永远只需要英语,因此只想使用 ASCII.这对 用户 Joe 来说可能没问题,但对于 软件开发人员 Joe 来说就不行了.世界上大约有一半的人使用非拉丁字符,而使用 ASCII 可以说对这些人来说是不体贴的,而且最重要的是,他正在将他的软件关闭到一个庞大且不断增长的经济体中.
But for argument's sake, lets say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.
因此,需要一个包含所有语言的包含字符集.Unicode 就这样诞生了.它为每个字符分配一个唯一编号,称为代码点.Unicode 相对于其他可能的集合的一个优点是前 256 个代码点与 ISO-8859 相同-1,因此也是 ASCII.此外,在名为 基本多语言平面 (BMP).现在需要一个字符编码来访问这个字符集,正如问题所问,我将专注于UTF-8和UTF-16.
Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.
那么有多少字节可以访问这些编码中的哪些字符?
So how many bytes give access to what characters in these encodings?
- UTF-8:
- 1 字节:标准 ASCII
- 2 个字节:阿拉伯语、希伯来语、大多数欧洲文字(最值得注意的是不包括 格鲁吉亚语)莉>
- 3 个字节:BMP
- 4 个字节:所有 Unicode 字符
- 2 字节:BMP
- 4 个字节:所有 Unicode 字符
现在值得一提的是,不在 BMP 中的字符包括古文字、数学符号、音乐符号,以及更稀有的中文/日文/韩文 (CJK) 字符.
It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese/Japanese/Korean (CJK) characters.
如果您主要使用 ASCII 字符,那么 UTF-8 肯定会更节省内存.但是,如果您主要使用非欧洲脚本,则使用 UTF-8 的内存效率可能比 UTF-16 低 1.5 倍.在处理大量文本(例如大型网页或冗长的 Word 文档)时,这可能会影响性能.
If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.
注意:如果你知道 UTF-8 和 UTF-16 是如何编码的,请跳到下一部分实际应用.
- UTF-8:对于标准 ASCII (0-127) 字符,UTF-8 代码是相同的.如果需要与现有 ASCII 文本向后兼容,这使 UTF-8 成为理想选择.其他字符需要 2-4 个字节.这是通过在每个字节中保留一些位来表示它是多字节字符的一部分来完成的.特别是每个字节的第一位是
1
,以避免与ASCII字符冲突. - UTF-16:对于有效的 BMP 字符,UTF-16 表示只是它的代码点.但是,对于非 BMP 字符,UTF-16 引入了代理对.在这种情况下,两个两字节部分的组合映射到非 BMP 字符.这些两字节部分来自 BMP 数字范围,但 Unicode 标准保证作为 BMP 字符无效.此外,由于 UTF-16 以两个字节为基本单位,因此受到 endianness 的影响.作为补偿,可以在数据流的开头放置一个保留的字节顺序标记,以指示字节序.因此,如果您正在读取 UTF-16 输入,并且未指定字节顺序,则必须检查这一点.
- UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is
1
to avoid clashing with the ASCII characters. - UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.
可以看出,UTF-8 和 UTF-16 根本不兼容.因此,如果您正在执行 I/O,请确保您知道您使用的是哪种编码!有关这些编码的更多详细信息,请参阅 UTF 常见问题解答.
As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.
字符和字符串数据类型:它们在编程语言中是如何编码的?如果它们是原始字节,那么在您尝试输出非 ASCII 字符的那一刻,您可能会遇到一些问题.此外,即使字符类型基于 UTF,也不意味着字符串是正确的 UTF.它们可能允许非法的字节序列.通常,您必须使用支持 UTF 的库,例如 ICU 用于 C、C++ 和爪哇.无论如何,如果您想输入/输出默认编码以外的内容,则必须先对其进行转换.
Character and String data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.
推荐/默认/主要编码:在选择使用哪种 UTF 时,通常最好遵循适用于您工作环境的推荐标准.例如,UTF-8 是在网络上占主导地位,从 HTML5 开始,它一直是推荐编码一>.相反,.NET 和 Java 环境都建立在 UTF-16 字符类型上.令人困惑(并且错误地),经常提到Unicode 编码",这通常是指在给定环境中占主导地位的 UTF 编码.
Recommended/default/dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.
库支持:您使用的库支持某种编码.哪一个?他们支持极端情况吗?由于必要性是发明之母,UTF-8 库通常会正确支持 4 字节字符,因为 1、2 甚至 3 字节字符可能经常出现.然而,并非所有声称的 UTF-16 库都正确支持代理对,因为它们很少出现.
Library support: The libraries you are using support some kind of encoding. Which one? Do they support the corner cases? Since necessity is the mother of invention, UTF-8 libraries will generally support 4-byte characters properly, since 1, 2, and even 3 byte characters can occur frequently. However, not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely.
计数字符:Unicode 中存在组合字符.例如,代码点 U+006E (n) 和 U+0303(组合波浪号)形成 ñ,但代码点 U+00F1 形成 ñ.它们看起来应该相同,但是一个简单的计数算法会为第一个示例返回 2,对于后者返回 1.这不一定是错误的,但也可能不是想要的结果.
Counting characters: There exist combining characters in Unicode. For example the code point U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ. They should look identical, but a simple counting algorithm will return 2 for the first example, 1 for the latter. This isn't necessarily wrong, but may not be the desired outcome either.
比较相等性: A、А 和 Α看起来一样,但它们分别是拉丁文、西里尔文和希腊文.你也有像 C 这样的案例.Ⅽ,一个是字母,另一个是罗马数字.此外,我们还需要考虑组合字符.有关详细信息,请参阅 Unicode 中的重复字符.
Comparing for equality: A, А, and Α look the same, but they're Latin, Cyrillic, and Greek respectively. You also have cases like C and Ⅽ, one is a letter, the other a Roman numeral. In addition, we have the combining characters to consider as well. For more info see Duplicate characters in Unicode.
代理对:这些在 SO 上经常出现,所以我只提供一些示例链接:
Surrogate pairs: These come up often enough on SO, so I'll just provide some example links:
其他?:
这篇关于什么是 Unicode、UTF-8、UTF-16?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!