ICU4J为什么不匹配UTF-8排序顺序? [英] Why doesn't ICU4J match UTF-8 sort order?
问题描述
我很难理解unicode的排序顺序.
I am having a hard time understanding unicode sorting order.
当我在ICU4J 55.1下运行Collator.getInstance(Locale.ENGLISH).compare("_", "#")
时,得到的返回值-1
表示_
在#
之前.
When I run Collator.getInstance(Locale.ENGLISH).compare("_", "#")
under ICU4J 55.1 I get a return value of -1
indicating that _
comes before #
.
但是,请查看 http://www.utf8- chartable.de/unicode-utf8-table.pl?utf8=dec 我看到#
(U + 0023)在_
(U + 005F)之前.为什么ICU4J返回-1
的值?
However, looking at http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec I see that #
(U+0023) comes before _
(U+005F). Why is ICU4J returning a value of -1
?
推荐答案
首先,UTF-8只是一种编码.它指定了如何物理存储Unicode代码点,但不处理排序,比较等.
First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.
现在,您链接到的页面将以数字代码点顺序显示所有内容.如果使用二进制排序规则,则按此顺序排序(在SQL Server中,排序规则的名称以_BIN
和_BIN2
结尾).但是非二进制排序要复杂得多.规则在此处描述: Unicode排序算法(UCA).
Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN
and _BIN2
). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).
可在此处找到基本规则: http ://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt
The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt
它显示:
005F ; [*010A.0020.0002] # LOW LINE
...
0023 ; [*0290.0020.0002] # NUMBER SIGN
请记住,任何语言环境/文化都可以覆盖这些基本规则,这一点非常重要.因此,尽管上面提到的几行解释了这种特定情况,但其他情况则需要检查 http://www.unicode.org/repos/cldr/tags/release-28/common/collation/来查看是否存在任何特定于语言环境的替代.
It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.
这篇关于ICU4J为什么不匹配UTF-8排序顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!