为什么没有UTF-24? [英] Why is there no UTF-24?

查看:201
本文介绍了为什么没有UTF-24?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能重复:
为什么使用UTF-32存在,而只需要21位就可以对每个字符进行编码?

Possible Duplicate:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

在UTF-32中,最大Unicode代码点为0x10FFFF. UTF-32具有21个信息位和11个多余的空白位.那么,为什么没有用于将每个代码点存储在3个字节而不是4个字节中的UTF-24编码(即,删除了高字节的UTF-32)?

The maximum Unicode code point is 0x10FFFF in UTF-32. UTF-32 has 21 information bits and 11 superfluous blank bits. So why is there no UTF-24 encoding (i.e. UTF-32 with the high byte removed) for storing each code point in 3 bytes rather than 4?

推荐答案

好吧,事实是:2007年提出了UTF-24的建议:

Well, the truth is : UTF-24 was suggested in 2007 :

http://unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html

上述优点和优点缺点是:

The mentioned pros & cons being :

"UTF-24 
Advantages: 
 1. Fixed length code units. 
 2. Encoding format is easily detectable for any content, even if mislabeled. 
 3. Byte order can be reliably detected without the use of BOM, even for single-code-unit data. 
 4. If octets are dropped / inserted, decoder can resync at next valid code unit. 
 5. Practical for both internal processing and storage / interchange. 
 6. Conversion to code point scalar values is more trivial then for UTF-16 surrogate pairs 
    and UTF-7/8 multibyte sequences. 
 7. 7-bit transparent version can be easily derived. 
 8. Most compact for texts in archaic scripts. 
Disadvantages: 
 1. Takes more space then UTF-8/16, except for texts in archaic scripts. 
 2. Comparing to UTF-32, extra bitwise operations required to convert to code point scalar values. 
 3. Incompatible with many legacy text-processing tools and protocols. "

David Starner在 http://www中指出. mail-archive.com/unicode@unicode.org/msg16011.html :

As pointed out by David Starner in http://www.mail-archive.com/unicode@unicode.org/msg16011.html :

为什么? UTF-24几乎总是比UTF-16大,除非您 正在使用Old Italic或Gothic谈论文档.数学字母数字 字符几乎总是会与足够的ASCII组合在一起 UTF-8获胜,如果没有,则足够的BMP字符使UTF-16获胜. 现代计算机不能很好地处理24位数据块.在记忆中,他们会 除非您声明将它们打包,否则每块占用32位,然后 它们会比UTF-16或UTF-32慢很多.如果您要存储到 磁盘,您也可以使用BOCU或SCSU(您已经准备好了 非标准),或将标准压缩与UTF-8,UTF-16,BOCU或 南加州大学. SCSU或BOCU压缩后应占用UTF-24的一半空间, 如果那样的话.

Why? UTF-24 will almost invariably be larger then UTF-16, unless you are talking a document in Old Italic or Gothic. The math alphanumberic characters will almost always be combined with enough ASCII to make UTF-8 a win, and if not, enough BMP characters to make UTF-16 a win. Modern computers don't deal with 24 bit chunks well; in memory, they'd take up 32 bits a piece, unless you declared them packed, and then they'd be a lot slower then UTF-16 or UTF-32. And if you're storing to disk, you may as well use BOCU or SCSU (you're already going non-standard), or use standard compression with UTF-8, UTF-16, BOCU or SCSU. SCSU or BOCU compressed should take up half the space of UTF-24, if that.

您还可以检查以下StackOverflow帖子:

You could also check the following StackOverflow post :

为什么存在UTF-32,而只需要21位就可以对每个字符进行编码?

这篇关于为什么没有UTF-24?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆