如何在Python中按字母顺序对unicode字符串排序? [英] How do I sort unicode strings alphabetically in Python?
问题描述
Python默认情况下按字节值排序,这意味着é在z和其他同样有趣的事情之后.在Python中按字母顺序排序的最佳方法是什么?
Python sorts by byte value by default, which means é comes after z and other equally funny things. What is the best way to sort alphabetically in Python?
这里有图书馆吗?我什么都找不到.最好是排序应具有语言支持,因此它理解åäö应该用瑞典语在z之后排序,但是ü应该用u进行排序,依此类推.因此,Unicode支持几乎是必需的.
Is there a library for this? I couldn't find anything. Preferrably sorting should have language support so it understands that åäö should be sorted after z in Swedish, but that ü should be sorted by u, etc. Unicode support is thereby pretty much a requirement.
如果没有库,执行此操作的最佳方法是什么?只需将字母映射到整数值,然后将字符串映射到整数列表?
If there is no library for it, what is the best way to do this? Just make a mapping from letter to a integer value and map the string to a integer list with that?
推荐答案
IBM的 ICU 库可以做到这一点(还有更多).它具有Python绑定: PyICU .
IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.
更新:ICU和locale.strcoll
之间排序的核心区别在于ICU使用完整的 ISO 14651 .
Update: The core difference in sorting between ICU and locale.strcoll
is that ICU uses the full Unicode Collation Algorithm while strcoll
uses ISO 14651.
这两种算法之间的区别简要总结如下: http://unicode.org/faq/collation.html#13 .这些是非常特殊的特殊情况,在实践中几乎没有关系.
The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, which should rarely matter in practice.
>>> import icu # pip install PyICU
>>> sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
>>> collator = icu.Collator.createInstance(icu.Locale('de_DE.UTF-8'))
>>> sorted(['a','b','c','ä'], key=collator.getSortKey)
['a', 'ä', 'b', 'c']
这篇关于如何在Python中按字母顺序对unicode字符串排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!