Unicode和多字节文章的困惑 [英] Confusion on Unicode and Multibyte Articles

查看:189
本文介绍了Unicode和多字节文章的困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过引用乔尔的文章

有些人在 误解认为Unicode只是一个 每个字符占用的16位代码 16位,因此有65,536 可能的字符.这不是, 实际上,正确.

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.

在阅读了整篇文章之后,我的意思是,如果有人告诉您,他的文本使用的是unicode,您将不知道他的每个角色占用了多少存储空间.他必须告诉您,我的unicode文本是以UTF-8编码的",那么只有您才能知道他的每个角色占用了多少存储空间.

After reading the whole article, my point is that, if someone told you, his text is in unicode, you will have no idea how much memory space taken up by every of his character. He have to tell you, "My unicode text is encoded in UTF-8", then only you will have idea how much memory space is taken up by every of his character.

Unicode =不需要每个字符2个字节

但是,当涉及到代码项目的文章

However, when comes to Code Project's Article and Microsoft's Help, this confused me :

微软:

Unicode是16位字符 编码,提供足够的编码 适用于所有语言.全部ASCII Unicode中包含以下字符: 加宽"字符.

Unicode is a 16-bit character encoding, providing enough encodings for all languages. All ASCII characters are included in Unicode as "widened" characters.


代码项目:


Code Project :

Unicode字符集是宽 字符"(每个字符2个字节)设置 包含每个字符 每种语言都可用,包括 所有技术符号和特殊 发布字符.多字节 字符集(MBCS)使用1或 每个字符2个字节

The Unicode character set is a "wide character" (2 bytes per character) set that contains every character available in every language, including all technical symbols and special publishing characters. Multibyte character set (MBCS) uses either 1 or 2 bytes per character

Unicode =每个字符2个字节?

65536个可能的字符能够代表这个世界上的所有语言吗?

Is 65536 possible characters able to represent all language in this world?

为什么该概念在Web开发人员社区和桌面开发人员社区之间似乎有所不同?

Why the concept seems different among web developer community and desktop developer community?

推荐答案

很久以前,

  • Unicode的字符数只能容纳16位,并且
  • UTF-8不存在或者不是实际使用的编码.

这些因素导致UTF-16(或现在称为UCS-2)被视为"Unicode"的同义词,因为毕竟 编码支持所有Unicode

These factors led to UTF-16 (or rather, what is now called UCS-2) to be considered synonymous with "Unicode", because it was after all the encoding which supported all of Unicode.

实际上,您会在意指"UTF-16"或"UCS-2"的地方看到"Unicode".这是历史上的混乱,应该忽略而不是传播. Unicode是个字符集; UTF-8,UTF-16和UCS-2是不同的编码.

Practically, you will see "Unicode" being used where "UTF-16" or "UCS-2" is meant. This is a historical confusion and should be ignored and not propagated. Unicode is a set of characters; UTF-8, UTF-16, and UCS-2 are different encodings.

(UTF-16和UCS-2之间的区别在于,UCS-2是真正的每个字符16位"编码,因此仅编码Unicode的"BMP"(基本多语言平面)部分,而UTF-16使用代理对"(总共32位)来编码BMP以上字符.

(The difference between UTF-16 and UCS-2 is that UCS-2 is a true 16-bits-per-"character" encoding, and therefore encodes only the "BMP" (Basic Multilingual Plane) portion of Unicode, whereas UTF-16 uses "surrogate pairs" (for a total of 32 bits) to encode above-BMP characters.)

这篇关于Unicode和多字节文章的困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆