什么是Unicode,UTF-8,UTF-16? [英] What is Unicode, UTF-8, UTF-16?

查看:119
本文介绍了什么是Unicode,UTF-8,UTF-16?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Unicode的基础是什么?为什么需要UTF-8或UTF-16?
我已经在Google上研究过,并在这里搜索,但我不清楚。



在VSS进行文件比较时,有时会出现一条消息,指出两个文件具有不同的UTF。为什么会这样?



请简单说明。

解决方案

为什么需要Unicode?

在(不太)早期,所有存在的都是ASCII。这是可以的,因为所有需要的是一些控制字符,标点符号,数字和字母像这个句子中的那些。不幸的是,今天奇怪的全球互通和社交媒体世界并没有预见到,在同一份文件中看到英文,العربية,汉语,עבבירִית,ελληνικά和)



但是,为了争论的缘故,让我们说Joe Average是一名软件开发人员。他坚持认为他只会需要英语,因此只想使用ASCII。这可能适用于Joe 用户,但是对于Joe 软件开发人员来说,这是不正确的。大约一半的世界使用非拉丁字符,使用ASCII可以说是对这些人不太了解的,除此之外,他正在关闭他的软件到一个庞大和不断增长的经济。



因此,需要包含所有
语言的包含字符集。因此来了Unicode。它为每个字符分配一个唯一的号码,称为代码点。 Unicode与其他可能的集合的一个优点是前256个代码点与 ISO-8859 -1 ,因此也是ASCII。此外,绝大多数常用字符只能由两个字节表示,名称为 基本多语言平面(BMP) 。现在需要一个字符编码才能访问这个字符集,正如问题所在,我将专注于UTF-8和UTF-16。



内存考虑



那么多少个字节可以访问这些编码中的哪些字符?




  • UTF-8:


    • 1个字节:标准ASCII

    • 2个字节:阿拉伯语,希伯来语,最多欧洲脚本(最显着的不包括格鲁吉亚语

    • 3个字节:BMP

    • 4个字节:所有Unicode字符


  • UTF-16: / strong>


    • 2个字节:BMP

    • 4个字节:所有Unicode字符




现在值得一提的是,不在BMP中的字符包括古老的脚本,数学符号,音乐符号, a href =http://www.unicode.org/faq/han_cjk.html =noreferrer>中文/日语/韩语(CJK)字符。



如果您将主要使用ASCII字符,那么UTF-8肯定会更有效率。但是,如果您主要使用非欧洲脚本,那么使用UTF-8的内存可能比UTF-16高出1.5倍。在处理大量文字(例如大型网页或冗长的单词文档)时,可能会影响性能。



编码基础知识


$注意:如果你知道UTF-8和UTF-16是如何编码的,请跳到下一节,了解实际应用。




  • UTF-8:对于标准ASCII(0-127)字符,UTF-8代码是相同的。这使得UTF-8非常理想,如果现有ASCII文本需要向后兼容性。其他字符需要2-4字节。这是通过在每个这些字节中保留一些位来表示它是多字节字符的一部分来完成的。特别地,每个字节的第一位是 1 ,以避免与ASCII字符冲突。

  • UTF-16: 对于有效的BMP字符,UTF-16表示只是其代码点。但是,对于非BMP字符,UTF-16引入了替代对。在这种情况下,两个双字节部分的组合映射到非BMP字符。这两个字节的部分来自BMP数字范围,但由Unicode标准保证无效为BMP字符。此外,由于UTF-16具有两个字节作为其基本单位,因此受字节顺序的影响。为了补偿,保留的字节顺序标记可以放置在表示字节顺序的数据流的开头。因此,如果您正在阅读UTF-16输入,并且没有指定字节顺序,则必须检查。



可以看出,UTF-8和UTF-16几乎不能相互兼容。所以如果你正在做I / O,请确保你知道你正在使用哪个编码!有关这些编码的更多详细信息,请参阅 UTF常见问题解答



实用的编程注意事项



字符和字符串数据类型:它们是如何以编程语言编码的?如果它们是原始字节,则尝试输出非ASCII字符的分钟,您可能会遇到一些问题。而且,即使字符类型基于UTF,这并不意味着字符串是正确的UTF。它们可能允许非法的字节序列。通常,您必须使用支持UTF的库,例如C,C ++的 ICU 和Java的。无论如何,如果要输入/输出默认编码以外的内容,则必须先转换。



推荐/默认/显性编码:当选择使用哪个UTF时,通常最好遵循您正在工作的环境的推荐标准。例如,UTF-8在网络上占主导地位,而且自HTML5以来,已经推荐的编码。相反,.NET和Java环境均基于UTF-16字符类型。混淆(和不正确),通常引用Unicode编码,通常是指在给定环境中占优势的UTF编码。



库支持:您正在使用哪些库支持哪些编码?他们是否支持角落案件?由于必要性是发明之母,UTF-8库通常会正确支​​持4字节字符,因为频繁出现1,2,甚至3个字节的字符。然而,并不是所有声称的UTF-16库都很适合支持代理对,因为它们很少出现。



计数字符 Unicode中的字符。例如,代码点U + 006E(n)和U + 0303(组合波浪号)形成n <#30;而代码点U + 00F1形成&#xF1;它们应该看起来相同,但是一个简单的计数算法将返回2为第一个例子,1为后者。这不一定是错误的,但也可能不是期望的结果。



比较相等性:&#x41;&#x410 ;和&#x391;看起来一样,但他们分别是拉丁语,西里尔语和希腊语。你也有例如&#x43;和&#x216D;一个是一封信,另一个是罗马数字。此外,我们还要考虑组合字符。有关详细信息,请参阅 Unicode中的重复字符



代理对:在SO上经常出现,所以我只需提供一些示例链接:





其他?:


What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it's not clear to me.

In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?

Please explain in simple terms.

解决方案

Why do we need Unicode?

In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).

But for argument's sake, lets say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

Memory considerations

So how many bytes give access to what characters in these encodings?

  • UTF-8:
    • 1 byte: Standard ASCII
    • 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
    • 3 bytes: BMP
    • 4 bytes: All Unicode characters
  • UTF-16:
    • 2 bytes: BMP
    • 4 bytes: All Unicode characters

It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese/Japanese/Korean (CJK) characters.

If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.

Encoding basics

Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.

  • UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.
  • UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.

As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.

Practical programming considerations

Character and String data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.

Recommended/default/dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.

Library support: What encodings are the libraries you are using support? Do they support the corner cases? Since necessity is the mother of invention, UTF-8 libraries will generally support 4-byte characters properly, since 1, 2, and even 3 byte characters can occur frequently. However, not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely.

Counting characters: There exist combining characters in Unicode. For example the code point U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ. They should look identical, but a simple counting algorithm will return 2 for the first example, 1 for the latter. This isn't necessarily wrong, but may not be the desired outcome either.

Comparing for equality: A, А, and Α look the same, but they're Latin, Cyrillic, and Greek respectively. You also have cases like C and Ⅽ, one is a letter, the other a Roman numeral. In addition, we have the combining characters to consider as well. For more info see Duplicate characters in Unicode.

Surrogate pairs: These come up often enough on SO, so I'll just provide some example links:

Others?:

这篇关于什么是Unicode,UTF-8,UTF-16?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆