对 Unicode 和多字节文章的困惑 [英] Confusion on Unicode and Multibyte Articles

查看:23
本文介绍了对 Unicode 和多字节文章的困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过引用 Joel 的文章

有些人在认为 Unicode 只是一个每个字符占用的 16 位代码16 位,因此有 65,536可能的字符.这不是,实际上,正确.

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.

读完整篇文章后,我的观点是,如果有人告诉你,他的文本是 unicode,你将不知道他的每个角色占用了多少内存空间.他必须告诉你,我的 unicode 文本是用 UTF-8 编码的",那么只有你知道他的每个角色占用了多少内存空间.

After reading the whole article, my point is that, if someone told you, his text is in unicode, you will have no idea how much memory space taken up by every of his character. He have to tell you, "My unicode text is encoded in UTF-8", then only you will have idea how much memory space is taken up by every of his character.

Unicode = 每个字符不需要 2 个字节

然而,当谈到 代码项目的文章微软的帮助,这让我很困惑:

However, when comes to Code Project's Article and Microsoft's Help, this confused me :

微软:

Unicode 是 16 位字符编码,提供足够的编码适用于所有语言.所有 ASCIIUnicode 中包含的字符为加宽"字符.

Unicode is a 16-bit character encoding, providing enough encodings for all languages. All ASCII characters are included in Unicode as "widened" characters.

<小时>

代码项目:


Code Project :

Unicode 字符集是宽字符"(每个字符 2 个字节)集包含每个字符提供各种语言版本,包括所有技术符号和特殊出版人物.多字节字符集 (MBCS) 使用 1 或每个字符 2 个字节

The Unicode character set is a "wide character" (2 bytes per character) set that contains every character available in every language, including all technical symbols and special publishing characters. Multibyte character set (MBCS) uses either 1 or 2 bytes per character

Unicode = 每个字符 2 个字节?

65536 个可能的字符能代表世界上所有的语言吗?

Is 65536 possible characters able to represent all language in this world?

为什么网络开发者社区和桌面开发者社区的概念似乎不同?

Why the concept seems different among web developer community and desktop developer community?

推荐答案

从前,

  • Unicode 的字符数只有 16 位,并且
  • UTF-8 不存在或不是要使用的事实上的编码.

这些因素导致 UTF-16(或者更确切地说,现在称为 UCS-2)被认为是Unicode"的同义词,因为它毕竟是支持所有 Unicode 的 编码.

These factors led to UTF-16 (or rather, what is now called UCS-2) to be considered synonymous with "Unicode", because it was after all the encoding which supported all of Unicode.

实际上,您会看到在表示UTF-16"或UCS-2"的地方使用了Unicode".这是一个历史性的混乱,应该被忽视而不是传播.Unicode 是一个字符集;UTF-8、UTF-16 和 UCS-2 是不同的编码.

Practically, you will see "Unicode" being used where "UTF-16" or "UCS-2" is meant. This is a historical confusion and should be ignored and not propagated. Unicode is a set of characters; UTF-8, UTF-16, and UCS-2 are different encodings.

(UTF-16 和 UCS-2 之间的区别在于 UCS-2 是真正的每字符"16 位编码,因此仅编码 Unicode 的BMP"(基本多语言平面)部分,而 UTF-16 使用代理对"(总共 32 位)来编码高于 BMP 的字符.)

(The difference between UTF-16 and UCS-2 is that UCS-2 is a true 16-bits-per-"character" encoding, and therefore encodes only the "BMP" (Basic Multilingual Plane) portion of Unicode, whereas UTF-16 uses "surrogate pairs" (for a total of 32 bits) to encode above-BMP characters.)

这篇关于对 Unicode 和多字节文章的困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆