ICU4J为什么不匹配UTF-8排序顺序? [英] Why doesn't ICU4J match UTF-8 sort order?

查看:129
本文介绍了ICU4J为什么不匹配UTF-8排序顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难理解unicode的排序顺序.

I am having a hard time understanding unicode sorting order.

当我在ICU4J 55.1下运行Collator.getInstance(Locale.ENGLISH).compare("_", "#")时,得到的返回值-1表示_#之前.

When I run Collator.getInstance(Locale.ENGLISH).compare("_", "#") under ICU4J 55.1 I get a return value of -1 indicating that _ comes before #.

但是,请查看 http://www.utf8- chartable.de/unicode-utf8-table.pl?utf8=dec 我看到#(U + 0023)在_(U + 005F)之前.为什么ICU4J返回-1的值?

However, looking at http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec I see that # (U+0023) comes before _ (U+005F). Why is ICU4J returning a value of -1?

推荐答案

首先,UTF-8只是一种编码.它指定了如何物理存储Unicode代码点,但不处理排序,比较等.

First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.

现在,您链接到的页面将以数字代码点顺序显示所有内容.如果使用二进制排序规则,则按此顺序排序(在SQL Server中,排序规则的名称以_BIN_BIN2结尾).但是非二进制排序要复杂得多.规则在此处描述: Unicode排序算法(UCA).

Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN and _BIN2). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).

可在此处找到基本规则: http ://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt

The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt

它显示:

005F  ; [*010A.0020.0002] # LOW LINE
...
0023  ; [*0290.0020.0002] # NUMBER SIGN

请记住,任何语言环境/文化都可以覆盖这些基本规则,这一点非常重要.因此,尽管上面提到的几行解释了这种特定情况,但其他情况则需要检查 http://www.unicode.org/repos/cldr/tags/release-28/common/collat​​ion/来查看是否存在任何特定于语言环境的替代.

It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.

这篇关于ICU4J为什么不匹配UTF-8排序顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆