有没有办法比较阿拉伯字符而不考虑它们的初始/中间/最终形式? [英] Is there a way to compare Arabic characters without regard to their initial/medial/final form?

查看:27
本文介绍了有没有办法比较阿拉伯字符而不考虑它们的初始/中间/最终形式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在拉丁文字中,字母有大写和小写形式.在 Python 中,如果您想比较两个字符串而不考虑它们的大小写,可以使用 'string'.upper()'string'.lower() 将它们转换为相同的大小写

在阿拉伯文字中,字母可以有首字母、中间字母或韵母.有没有类似的方法来比较阿拉伯字符的字符串而不用关心字母的形式?

解决方案

这有两个部分,应该适用于所有语言:*

  • 您的字符串必须进行 NFKD 规范化,以保证两个相同的字符串具有相同的代码单元.
  • 要在比较两个 NFKD 字符串时忽略大小写,请使用 Unicode 大小写折叠算法.

在两者之间,这处理英语大写和小写,阿拉伯语的首字母/中/尾(加上隔离),德语ß vs. ssé 作为单个代码点 vs. e\N{COMBINING ACUTE ACCENT}、中文旋转字符、日文半角假名,以及其他各种你没有想到的东西.

在 Python 中,这看起来像这样:

<预><代码>>>>s1 = 'ﻧ'>>>s2 = 'ﻨ'>>>unicodedata.normalize('NFKD', s1).casefold() == unicodedata.normalize('NFKD', s2)真的

请注意,casefold 直到 Python 3.3 才被添加.如果您使用的是早期版本的 Python,则可以在 PyPI 上实现;使用它们应该类似于使用 3.3+ 内置.

<小时>

如果您对如何对阿拉伯语的工作感兴趣,而不仅仅是它对阿拉伯语和其他所有语言都适用这一事实,您已经阅读了 unicode.org 上的算法和表格.IIRC,推荐这样做的 W3C 文档解释了为什么它以阿拉伯语为例.我相信这是因为 Unicode 将初始、中间、最终和隔离视为相同字符的兼容性等效表示形式,因此标准化为分解可以有效地为您提供隔离形式以及 casefolding 可以跳过或转换的修饰符,即使 casefolding 直接在组合字符只返回字符本身.

<小时>

* 在少数情况下,两种不同的语言或文化使用相同的脚本,但具有不同的大小写折叠规则;在这种情况下,您需要特定于语言环境的 casefolding,Python 不包括.但这不应该在这里相关.

In Latin script, letters have an upper case and a lower case form. In Python, if you want to compare two strings without regard to their case, you can convert them to the same case using 'string'.upper() or 'string'.lower()

In Arabic script, letters can have an initial, medial, or final form. Is there a similar way to compare strings of Arabic characters without caring which form the letters are in?

解决方案

There are two parts to this, which should work for all languages:*

  • Your strings must be into NFKD normalization to guarantee that two equal strings have equal code units.
  • To ignore case in comparing two NFKD strings, use the Unicode case-folding algorithm.

Between the two, this handles English upper and lower case, Arabic initial/medial/final (plus isolated), German ß vs. ss, é as a single code point vs. e\N{COMBINING ACUTE ACCENT}, Chinese rotated characters, Japanese half-width kana, and probably all kinds of other things you haven't thought of.

In Python, that looks like this:

>>> s1 = 'ﻧ'
>>> s2 = 'ﻨ'
>>> unicodedata.normalize('NFKD', s1).casefold() == unicodedata.normalize('NFKD', s2)
True

Note that casefold wasn't added until Python 3.3. If you're using an earlier version of Python, there are implementations on PyPI; using them should be similar to using the 3.3+ builtin.


If you're interested in exactly how this works for Arabic, rather than just the fact that it works for Arabic along with every other language, you have read the algorithms and tables at unicode.org. IIRC, the W3C document that recommends doing this explains why it works using Arabic as an example. I believe it's because Unicode treats initial, medial, final, and isolated as compatibility-equivalent presentation forms of the same character, so normalizing to decomposed gives you effectively the isolated form plus a modifier that casefolding can skip or transform, even though casefolding directly on a combined character just returns the character itself.


* There are a few cases where two different languages or cultures use the same script, but have different case-folding rules; in that case, you need locale-specific casefolding, which Python doesn't include. But that shouldn't be relevant here.

这篇关于有没有办法比较阿拉伯字符而不考虑它们的初始/中间/最终形式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆