什么是标准化的UTF-8一回事呢? [英] What is normalized UTF-8 all about?

查看:137
本文介绍了什么是标准化的UTF-8一回事呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

借助 ICU项目(也现拥有的 PHP库)包含了帮助恢复正常UTF-8字符串所需的类进行搜索时更容易地比较值。

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

不过,我想弄清楚这是什么意思的应用程序。例如,在这种情况下,做我想做的规范等价,而不是兼容性等,或VIS-相反?

However, I'm trying to figure out what this means for applications. For example, in which cases do I want "Canonical Equivalence" instead of "Compatibility equivalence", or vis-versa?

推荐答案

统一code包括C语言带来的角色,最显着的重音符号多种方式连接$ C $。规范正常化改变了code点成规范的编码格式。由此产生的code点应该出现等同于原有的禁止任何错误的字体或渲染引擎。

Everything You Never Wanted to Know about Unicode Normalization

Canonical Normalization

Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form. The resulting code points should appear identical to the original ones barring any bugs in the fonts or rendering engine.

由于结果出现相同的,它始终是安全存储或显示出来,只要你能忍受的结果不是对位相同的输入位之前,规范的标准化应用到字符串。

Because the results appear identical, it is always safe to apply canonical normalization to a string before storing or displaying it, as long as you can tolerate the result not being bit for bit identical to the input.

规范正常化也有两个形式:NFD和NFC。这两个是一个可以这两种形式之间的转换没有损失感等同。 NFC比较下两个字符串总是产生相同的结果作为其中NFD下的比较。

Canonical normalization comes in 2 forms: NFD and NFC. The two are equivalent in the sense that one can convert between these two forms without loss. Comparing two strings under NFC will always give the same result as comparing them under NFD.

NFD具有字符完全展开的。这是更快的正常化形式来计算的,但结果在更code点(即,使用更多的空间)。

NFD has the characters fully expanded out. This is the faster normalization form to calculate, but the results in more code points (i.e. uses more space).

如果你只是想比较两个字符串尚未标准化,这是preferred规范化的形式,除非你知道你需要的兼容性正常化。

If you just want to compare two strings that are not already normalized, this is the preferred normalization form unless you know you need compatibility normalization.

NFC重组code点的时候可能运行NFD算法后。这需要一点时间,但会导致更短的字符串。

NFC recombines code points when possible after running the NFD algorithm. This takes a little longer, but results in shorter strings.

统一code还包括真的不属于人物众多,但在传统的字符集被使用。单向code加到这些以允许在这些字符文本设置要处理为单向code和然后无损失地转换回

Unicode also includes many characters that really do not belong, but were used in legacy character sets. Unicode added these to allow text in those character sets to be processed as Unicode, and then be converted back without loss.

兼容性正常化这些转换的真正的字符,相应的序列,并且还执行规范正常化。兼容性正常化的结果可能不会出现相同的原件。

Compatibility normalization converts these to the corresponding sequence of "real" characters, and also performs canonical normalization. The results of compatibility normalization may not appear identical to the originals.

字符,包括格式化信息被替换那些没有。例如字符被转换为 9 。其他不涉及格式的差异。例如,罗马数字字符转换为普通信件 IX

Characters that include formatting information are replaced with ones that do not. For example the character gets converted to 9. Others don't involve formatting differences. For example the roman numeral character is converted to the regular letters IX.

显然,一旦这种转变已被执行,则不再可能无损转换回原始字符集

Obviously, once this transformation has been performed, it is no longer possible to losslessly convert back to the original character set.

单向code协会建议像兼容性正常化的思维与toUpperCase 变换。这是什么,可能是在某些情况下是有用的,但你不应该仅仅运用它不管三七二十一。

The Unicode Consortium suggests thinking of compatibility normalization like a ToUpperCase transform. It is something that may be useful in some circumstances, but you should not just apply it willy-nilly.

这是极好的使用案例将是一个搜索引擎,因为你可能会想要搜索 9 来匹配

An excellent use case would be a search engine since you would probably want a search for 9 to match .

有一件事情你可能不应该做的是显示应用兼容性归给用户的结果。

One thing you should probably not do is display the result of applying compatibility normalization to the user.

兼容性范式有两种形式NFKD和NFKC。它们具有如NFD和C之间的关系相同。

Compatibility normalization form comes in two forms NFKD and NFKC. They have the same relationship as between NFD and C.

在NF​​KC任何字符串是固有地也在NFC,和同为NFKD和NFD。因此, NFKD(x)= NFD(NFKC(X)) NFKC(X)= NFC(NFKD(X))

Any string in NFKC is inherently also in NFC, and the same for the NFKD and NFD. Thus NFKD(x)=NFD(NFKC(x)), and NFKC(x)=NFC(NFKD(x)), etc.

如果有疑问,去规范正常化。选择基于NFC或NFD上的空间/速度的权衡适用的,或者基于由什么需要什么,你是互操作与

If in doubt, go with canonical normalization. Choose NFC or NFD based on the space/speed trade-off applicable, or based on what is required by something you are inter-operating with.

这篇关于什么是标准化的UTF-8一回事呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆