如何在 Python 中按字母顺序对 unicode 字符串进行排序? [英] How do I sort unicode strings alphabetically in Python?

查看:56
本文介绍了如何在 Python 中按字母顺序对 unicode 字符串进行排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python 默认按字节值排序,这意味着 é 在 z 和其他同样有趣的东西之后.在 Python 中按字母顺序排序的最佳方法是什么?

有这方面的图书馆吗?我找不到任何东西.最好的排序应该有语言支持,所以它理解 åäö 应该在瑞典语中的 z 之后排序,但是 ü 应该按 u 等排序.因此几乎需要 Unicode 支持.

如果没有库,那么最好的方法是什么?只需将字母映射到整数值并将字符串映射到整数列表即可?

解决方案

IBM 的 ICU 库做到了这一点(以及更多).它具有 Python 绑定:PyICU.

更新:ICU 和 locale.strcoll 之间排序的核心区别在于 ICU 使用完整的 Unicode 整理算法strcoll 使用 ISO 14651.

这两种算法之间的区别在这里简要总结:http://unicode.org/faq/collat​​ion.html#13.这些是相当奇特的特殊情况,在实践中很少有影响.

<预><代码>>>>导入 icu # pip 安装 PyICU>>>排序(['a','b','c','ä'])['a', 'b', 'c', 'ä']>>>collat​​or = icu.Collat​​or.createInstance(icu.Locale('de_DE.UTF-8'))>>>sorted(['a','b','c','ä'], key=collat​​or.getSortKey)['a', 'ä', 'b', 'c']

Python sorts by byte value by default, which means é comes after z and other equally funny things. What is the best way to sort alphabetically in Python?

Is there a library for this? I couldn't find anything. Preferrably sorting should have language support so it understands that åäö should be sorted after z in Swedish, but that ü should be sorted by u, etc. Unicode support is thereby pretty much a requirement.

If there is no library for it, what is the best way to do this? Just make a mapping from letter to a integer value and map the string to a integer list with that?

解决方案

IBM's ICU library does that (and a lot more). It has Python bindings: PyICU.

Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651.

The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, which should rarely matter in practice.

>>> import icu # pip install PyICU
>>> sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
>>> collator = icu.Collator.createInstance(icu.Locale('de_DE.UTF-8'))
>>> sorted(['a','b','c','ä'], key=collator.getSortKey)
['a', 'ä', 'b', 'c']

这篇关于如何在 Python 中按字母顺序对 unicode 字符串进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆