有没有一种方法可以比较阿拉伯字符而无需考虑其初始/中间/最终形式? [英] Is there a way to compare Arabic characters without regard to their initial/medial/final form?

查看:157
本文介绍了有没有一种方法可以比较阿拉伯字符而无需考虑其初始/中间/最终形式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在拉丁字母中,字母具有大写和小写形式.在Python中,如果要比较两个字符串而不考虑大小写,可以使用'string'.upper()'string'.lower()

In Latin script, letters have an upper case and a lower case form. In Python, if you want to compare two strings without regard to their case, you can convert them to the same case using 'string'.upper() or 'string'.lower()

在阿拉伯语脚本中,字母可以具有首字母,中间字母或最终形式.有没有类似的方法可以比较阿拉伯字符的字符串而无需关心字母的形式?

In Arabic script, letters can have an initial, medial, or final form. Is there a similar way to compare strings of Arabic characters without caring which form the letters are in?

推荐答案

这有两个部分,它们适用于所有语言: *

There are two parts to this, which should work for all languages:*

  • 您的字符串必须纳入NFKD规范化,以确保两个相等的字符串具有相同的代码单元.
  • 要在比较两个NFKD字符串时忽略大小写,请使用Unicode大小写折叠算法.

在这两者之间,它处理英语的大写和小写字母,阿拉伯语的首字母/中间字母/最后字母(加上孤立的字符),德语ß vs. ssé作为单个代码点vs. e\N{COMBINING ACUTE ACCENT},中文旋转字符,日语半角假名,以及可能还没有想到的所有其他内容.

Between the two, this handles English upper and lower case, Arabic initial/medial/final (plus isolated), German ß vs. ss, é as a single code point vs. e\N{COMBINING ACUTE ACCENT}, Chinese rotated characters, Japanese half-width kana, and probably all kinds of other things you haven't thought of.

在Python中,它看起来像这样:

In Python, that looks like this:

>>> s1 = 'ﻧ'
>>> s2 = 'ﻨ'
>>> unicodedata.normalize('NFKD', s1).casefold() == unicodedata.normalize('NFKD', s2)
True

请注意,直到Python 3.3才添加casefold.如果您使用的是Python的早期版本,则PyPI上有一些实现.使用它们应该类似于使用3.3+内置版本.

Note that casefold wasn't added until Python 3.3. If you're using an earlier version of Python, there are implementations on PyPI; using them should be similar to using the 3.3+ builtin.

如果您对完全适用于阿拉伯语的操作方式感兴趣,而不仅仅是对它与其他所有语言都适用于阿拉伯语的事实感兴趣,请阅读unicode.org上的算法和表格. IIRC是W3C推荐使用的文档,它解释了为什么使用阿拉伯语作为示例.我相信这是因为Unicode将初始,中间,最终和隔离视为相同字符的等同于兼容性的表示形式,因此归一化为分解可有效地为您提供隔离形式以及修饰符,即使casecase折叠直接在casefold上也可以跳过或转换.组合字符只是返回字符本身.

If you're interested in exactly how this works for Arabic, rather than just the fact that it works for Arabic along with every other language, you have read the algorithms and tables at unicode.org. IIRC, the W3C document that recommends doing this explains why it works using Arabic as an example. I believe it's because Unicode treats initial, medial, final, and isolated as compatibility-equivalent presentation forms of the same character, so normalizing to decomposed gives you effectively the isolated form plus a modifier that casefolding can skip or transform, even though casefolding directly on a combined character just returns the character itself.

*在某些情况下,两种不同的语言或文化使用相同的脚本,但具有不同的大小写折叠规则;在这种情况下,您需要特定于语言环境的案例折叠,Python不提供.但这与这里无关.

这篇关于有没有一种方法可以比较阿拉伯字符而无需考虑其初始/中间/最终形式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆