如何在Python中按字母顺序对unicode字符串排序? [英] How do I sort unicode strings alphabetically in Python?

查看:378
本文介绍了如何在Python中按字母顺序对unicode字符串排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python默认情况下按字节值排序,这意味着é在z和其他同样有趣的事情之后.在Python中按字母顺序排序的最佳方法是什么?

Python sorts by byte value by default, which means é comes after z and other equally funny things. What is the best way to sort alphabetically in Python?

这里有图书馆吗?我什么都找不到.最好是排序应具有语言支持,因此它理解åäö应该用瑞典语在z之后排序,但是ü应该用u进行排序,依此类推.因此,Unicode支持几乎是必需的.

Is there a library for this? I couldn't find anything. Preferrably sorting should have language support so it understands that åäö should be sorted after z in Swedish, but that ü should be sorted by u, etc. Unicode support is thereby pretty much a requirement.

如果没有库,执行此操作的最佳方法是什么?只需将字母映射到整数值,然后将字符串映射到整数列表?

If there is no library for it, what is the best way to do this? Just make a mapping from letter to a integer value and map the string to a integer list with that?

推荐答案

IBM的 ICU 库可以做到这一点(还有更多).它具有Python绑定: PyICU .

IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.

更新:ICU和locale.strcoll之间排序的核心区别在于ICU使用完整的 ISO 14651 .

Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651.

这两种算法之间的区别简要总结如下: http://unicode.org/faq/collat​​ion.html#13 .这些是非常特殊的特殊情况,在实践中几乎没有关系.

The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, which should rarely matter in practice.

>>> import icu # pip install PyICU
>>> sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
>>> collator = icu.Collator.createInstance(icu.Locale('de_DE.UTF-8'))
>>> sorted(['a','b','c','ä'], key=collator.getSortKey)
['a', 'ä', 'b', 'c']

这篇关于如何在Python中按字母顺序对unicode字符串排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆