使用 (Core)Foundation 折叠/规范化连字(例如 Æ 到 ae) [英] Folding/Normalizing Ligatures (e.g. Æ to ae) Using (Core)Foundation

查看:32
本文介绍了使用 (Core)Foundation 折叠/规范化连字(例如 Æ 到 ae)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个对输入字符串执行许多转换的帮助程序,以便创建该字符串的搜索友好表示.

I am writing a helper that performs a number of transformations on an input string, in order to create a search-friendly representation of that string.

考虑以下场景:

  • 对德语或法语文本进行全文搜索
  • 数据存储中的条目包含
  • Full text search on German or French texts
  • The entries in your datastore contain
  1. 穆勒
  2. Großmann
  3. Çingletòn
  4. 比约克
  5. Æreogramme

  • 搜索应该是模糊的,因为

  • The search should be fuzzy, in that

    1. ullÜll 等匹配 Müller
    2. Grosgroß 等匹配 Großmann
    3. cin 等匹配 Çingletòn
    4. bjö, bjo 等匹配 Bjørk
    5. aereo 等匹配 Æreogramme
    1. ull, Üll etc. match Müller
    2. Gros, groß etc. match Großmann
    3. cin etc. match Çingletòn
    4. bjö, bjo etc. match Bjørk
    5. aereo etc. match Æreogramme

  • 到目前为止,我已经在案例 (1)、(3) 和 (4) 中取得了成功.

    So far, I've been successful in cases (1), (3) and (4).

    我无法弄清楚的是如何处理(2)和(5).

    What I cannot figure out, is how to handle (2) and (5).

    到目前为止,我尝试了以下方法都无济于事:

    So far, i've tried the following methods to no avail:

    CFStringNormalize() // with all documented normalization forms
    CFStringTransform() // using the kCFStringTransformToLatin, kCFStringTransformStripCombiningMarks, kCFStringTransformStripDiacritics
    CFStringFold() // using kCFCompareNonliteral, kCFCompareWidthInsensitive, kCFCompareLocalized in a number of combinations -- aside: how on earth do I normalize simply _composing_ already decomposed strings??? as soon as I pack that in, my formerly passing tests fail, as well...
    

    我已经浏览了 ICU 转换用户指南,但没有投资太多了......我认为这是显而易见的原因.

    I've skimmed over the ICU User Guide for Transforms but didn't invest too heavily in it…for what I think are obvious reasons.

    我知道我可以通过转换为大写然后再转换回小写来捕获案例 (2),这将在此特定应用程序的领域内工作.但是,我有兴趣在更基础的层面上解决这个问题,希望也能允许区分大小写的应用程序.

    I know that I could catch case (2) by transforming to uppercase and then back to lowercase, which would work within the realms of this particular application. I am, however, interested in solving this problem on a more fundamental level, hopefully allowing for case-sensitive applications as well.

    任何提示将不胜感激!

    推荐答案

    恭喜你,你发现了文本处理中最痛苦的部分!

    Congratulations, you've found one of the more painful bits of text processing!

    首先,NamesList.txtCaseFolding.txt 是此类事情不可或缺的资源,如果您还没有看过的话.

    First off, NamesList.txt and CaseFolding.txt are indispensable resources for things like this, if you haven't already seen them.

    问题的一部分是你试图做一些几乎正确的事情,它适用于你关心的所有语言/地区,而 Unicode 更关心在显示字符串时做正确的事情单一语言区域.

    Part of the problem is you're trying to do something almost correct that works in all the languages/locales you care about, whereas Unicode is more concerned about doing the correct thing when displaying strings in a single language-locale.

    对于 (2),ß 从我能找到的最早的 CaseFolding.txt(3.0-Update1/CaseFolding-2.txt).CFStringFold()-[NSString stringByFoldingWithOptions:] 应该做正确的事情,但如果不是,一个独立于语言环境的"s.upper().lower() 似乎为所有输入提供了合理的答案(并且还处理了臭名昭著的土耳其语 I").

    For (2), ß has canonically case-folded to ss since the earliest CaseFolding.txt I can find (3.0-Update1/CaseFolding-2.txt). CFStringFold() and -[NSString stringByFoldingWithOptions:] ought to do the right thing, but if not, a "locale-independent" s.upper().lower() appears to give a sensible answer for all inputs (and also handles the infamous "Turkish I").

    对于 (5),您有点不走运:Unicode 6.2 似乎不包含从 Æ 到 AE 的规范映射,并且已从字母"更改为连字"并再次返回 (U+00C6是 1.0 中的 LATIN CAPITAL LETTER AE、1.1 中的 LATIN CAPITAL LIGATURE AE 和 2.0 中的 LATIN CAPITAL LETTER AE).您可以在 NamesList.txt 中搜索连字"并添加一些特殊情况.

    For (5), you're a little out of luck: Unicode 6.2 doesn't appear to contain a normative mapping from Æ to AE and has changed from "letter" to "ligature" and back again (U+00C6 is LATIN CAPITAL LETTER A E in 1.0, LATIN CAPITAL LIGATURE AE in 1.1, and LATIN CAPITAL LETTER AE in 2.0). You could search NamesList.txt for "ligature" and add a bunch of special cases.

    注意事项:

    • CFStringNormalize() 不符合您的要求.您确实希望在将字符串添加到索引之前对其进行规范化;我建议在其他处理的开始和结束时使用 NFKC.
    • CFStringTransform() 也不是你想要的;所有脚本都是拉丁"的
    • CFStringFold() 依赖于顺序:组合 ypogegrammeni 和 prosgegrammenikCFCompareDiacriticInsensitive 去除,但被 kCFCompareCaseInsensitive 转换为小写的 iota.正确"的做法似乎是先进行大小写折叠,然后再进行其他操作,尽管从语言上剥离它可能更有意义.
    • 您几乎肯定不想使用 kCFCompareLocalized,除非您想在每次语言环境更改时重建搜索索引.
    • CFStringNormalize() doesn't do what you want. You do want to normalize strings before adding them to the index; I suggest NFKC at the start and end of other processing.
    • CFStringTransform() doesn't quite do what you want either; all the scripts are "latin"
    • CFStringFold() is order-dependent: The combining ypogegrammeni and prosgegrammeni are stripped by kCFCompareDiacriticInsensitive but converted to a lowercase iota by kCFCompareCaseInsensitive. The "correct" thing appears to be to do the case-fold first followed by the others, although stripping it may make more sense linguistically.
    • You almost certainly do not want to use kCFCompareLocalized unless you want to rebuild the search index every time the locale changes.

    其他语言的读者注意:请检查您使用的功能是否依赖于用户的当前语言环境!Java 用户应该使用类似 s.toUpperCase(Locale.ENGLISH) 的东西,.NET 用户应该使用 s.ToUpperInvariant().如果您确实需要用户的当前语言环境,请明确指定.

    Readers from other languages note: Check that the function you use is not dependent on the user's current locale! Java users should use something like s.toUpperCase(Locale.ENGLISH), .NET users should use s.ToUpperInvariant(). If you actually want the user's current locale, specify it explicitly.

    这篇关于使用 (Core)Foundation 折叠/规范化连字(例如 Æ 到 ae)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆