Unicode NFC规范化可以增加字符串的长度吗? [英] Can Unicode NFC normalization increase the length of a string?

查看:125
本文介绍了Unicode NFC规范化可以增加字符串的长度吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我将Unicode规范化形式C应用于字符串,字符串中的代码点数量会增加吗?

If I apply Unicode Normalization Form C to a string, will the number of code points in the string ever increase?

推荐答案

是的,有些代码点在应用NFC归一化后会扩展为多个代码点.例如,在基本多语言平面中,有70个代码点扩展为2个代码点在应用NFC归一化之后,在字母表示形式块),可扩展到3个代码点.

Yes, there are code points that expand to multiple code points after applying NFC normalization. Within the Basic Multilingual Plane, for example, there are 70 code points that expand to 2 code points after applying NFC normalization, and there are 2 code points (U+FB2C and U+FB2D within the Alphabetic Presentation Forms block) that expand to 3 code points.

对于此所谓的扩展因子",一个保证是,任何字符串的扩展长度都不会超过3倍(以

One guarantee that you have for this so-called "expansion factor" is that no string will ever expand more than 3 times in length (in terms of number of code units) after NFC normalization is applied:

还有一个Unicode Consortium稳定性策略,在所有版本的Unicode中,规范映射始终受到限制,因此在使用NFC分解时,没有字符串会扩展到超过3倍的长度(以代码单位为单位).无论文本是UTF-8,UTF-16还是UTF-32,都是如此.这种保证还允许在处理过程中进行某些优化,尤其是在确定缓冲区大小方面.

There is also a Unicode Consortium stability policy that canonical mappings are always limited in all versions of Unicode, so that no string when decomposed with NFC expands to more than 3× in length (measured in code units). This is true whether the text is in UTF-8, UTF-16, or UTF-32. This guarantee also allows for certain optimizations in processing, especially in determining buffer sizes.

第9节,检测规范化表单. UAX#15:Unicode规范化表单.

Section 9, Detecting Normalization Forms. UAX #15: Unicode Normalization Forms.

我已经编写了一个Java程序来确定Unicode块中的哪些代码点扩展为多个代码点: http://ideone.com/9PUOCb

I have written a Java program to determine which code points within a Unicode block expand to multiple code points: http://ideone.com/9PUOCb

或者,汤姆·克里斯蒂安森

Alternatively, Tom Christiansen's unichars utility, part of the Unicode::Tussle CPAN module, can be used. (Note: Mac users may see an error at the make test installation step saying that the Perl version is too old. If you see this error, you can install the module by running notest install Unicode::Tussle within a CPAN shell.)

示例:

  • 在BMP中打印扩展到3个代码点的代码点:

  • Print the code points in the BMP that expand to 3 code points:

unichars 'length(NFC) == 3'

‭‭ שּׁ  U+FB2C HEBREW LETTER SHIN WITH DAGESH AND SHIN DOT
‭ שּׂ  U+FB2D HEBREW LETTER SHIN WITH DAGESH AND SIN DOT

  • 计算所有平面上扩展到一个以上代码点的代码点数:

  • Count the number of code points in all planes that expand to more than one code point:

    unichars -a 'length(NFC) > 1' | wc -l

          85

  • 另请参阅常见问题不同规范化形式的最大扩展因子是多少?

    这篇关于Unicode NFC规范化可以增加字符串的长度吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆