通过字形,而不是性格枚举字符串 [英] Enumerating a string by grapheme instead of character

查看:209
本文介绍了通过字形,而不是性格枚举字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

字符串通常是由字符计数。但是,particuarly和UNI code和非英语语言工作时,有时我需要通过字形枚举的字符串。也就是说,组合标记和变音符号应保持与他们修改的基本字符。什么是做到这一点。NET中的最佳方法是什么?

使用情况:计数的独特语音的声音在一系列 IPA 的话。

  1. 简化的定义:有一字形和声音之间有一个一对一的关系
  2. 现实的定义:由两个符号psented 特殊信状也应包括在基本字符(例P)的字符和一些声音可能会被重新$ P $加入了一个拉杆(KP)。
解决方案

简体场景

的<一个href="http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.aspx">TextElementEnumerator是非常有用和有效的:

 私有静态列表&LT; SoundCount&GT; CountSounds(IEnumerable的&LT;字符串&GT;字)
{
    字典&LT;字符串,SoundCount&GT; soundCounts =新字典&LT;字符串,SoundCount&GT;();

    的foreach(文字VAR字)
    {
        TextElementEnumerator graphemeEnumerator = StringInfo.GetTextElementEnumerator(字);
        而(graphemeEnumerator.MoveNext())
        {
            串字形= graphemeEnumerator.GetTextElement();

            SoundCount计数;
            如果(!soundCounts.TryGetValue(字形,出数))
            {
                数=新SoundCount(){声音=字形};
                soundCounts.Add(字形,计数);
            }
            count.Count ++;
        }
    }

    返回新的List&LT; SoundCount&GT;(soundCounts.Values​​);
}
 

您也可以使用常规的前pression做到这一点:(从文档中,TextElementEnumerator处理少数情况下,以下前pression没有,尤其是增补字符,但这些都是pretty的罕见的,并且在任何情况下都不需要我的应用程序。)

 私有静态列表&LT; SoundCount&GT; CountSoundsRegex(IEnumerable的&LT;字符串&GT;字)
{
    VAR soundCounts =新字典&LT;字符串,SoundCount&GT;();
    VAR graphemeEx pression =新的正则表达式(@\ P {M-} \ p {M-} *);

    的foreach(文字VAR字)
    {
        匹配graphemeMatch = graphemeEx pression.Match(字);
        而(graphemeMatch.Success)
        {
            串字形= graphemeMatch.Value;

            SoundCount计数;
            如果(!soundCounts.TryGetValue(字形,出数))
            {
                数=新SoundCount(){声音=字形};
                soundCounts.Add(字形,计数);
            }
            count.Count ++;

            graphemeMatch = graphemeMatch.NextMatch();
        }
    }

    返回新的List&LT; SoundCount&GT;(soundCounts.Values​​);
}
 

性能:在我的测试中,我发现,TextElementEnumerator约4倍的速度是普通的前pression

现实场景

不幸的是,没有办法来调整如何TextElementEnumerator枚举,使得类将在现实的情况下没有用的。

解决方案之一是调整我们的日常EX pression:

  [\ P {M-} \ p {Lm的}]#匹配的字符是不是性格打算与另一个字符或特殊字符,用于像一个字母组合
(?:#开始一组用于组合字符:
  (?:#开始一组绑字符:
    [\ u035C \ u0361]#匹配的过高或过低拉杆...
    \ P {M-} \ p {M-} *#...之后又字形(如简化的意义上)
  )#(完绑字符组)
  | \ p {M-}#或字符打算与其他角色进行合并
  | \ p {Lm的}#或用于像一个字母一个特殊字符
)*#匹配组合字符组零次或多次。
 

我们也许可以同时创建我们自己的IEnumerator&LT;字符串&GT;使用CharUni codeInfo.GetUni codeCategory恢复我们的服务表现,但似乎有太多的工作,我和额外的code来维持。 (任何人都希望有一个去?)的正则表达式是为这个做。

Strings are usually enumerated by character. But, particuarly when working with Unicode and non-English languages, sometimes I need to enumerate a string by grapheme. That is, combining marks and diacritics should be kept with the base character they modify. What is the best way to do this in .Net?

Use case: Count the distinct phonetic sounds in a series of IPA words.

  1. Simplified definition: There is a one-to-one relationship between a grapheme and a sound.
  2. Realistic definition: Special "letter-like" characters should also be included with the base character (ex. pʰ), and some sounds may be represented by two symbols joined by a tie bar (k͡p).

解决方案

Simplified scenario

The TextElementEnumerator is very useful and efficient:

private static List<SoundCount> CountSounds(IEnumerable<string> words)
{
    Dictionary<string, SoundCount> soundCounts = new Dictionary<string, SoundCount>();

    foreach (var word in words)
    {
        TextElementEnumerator graphemeEnumerator = StringInfo.GetTextElementEnumerator(word);
        while (graphemeEnumerator.MoveNext())
        {
            string grapheme = graphemeEnumerator.GetTextElement();

            SoundCount count;
            if (!soundCounts.TryGetValue(grapheme, out count))
            {
                count = new SoundCount() { Sound = grapheme };
                soundCounts.Add(grapheme, count);
            }
            count.Count++;
        }
    }

    return new List<SoundCount>(soundCounts.Values);
}

You can also do this using a regular expression: (From the documentation, the TextElementEnumerator handles a few cases that the expression below does not, particularly supplementary characters, but those are pretty rare, and in any case not needed for my application.)

private static List<SoundCount> CountSoundsRegex(IEnumerable<string> words)
{
    var soundCounts = new Dictionary<string, SoundCount>();
    var graphemeExpression = new Regex(@"\P{M}\p{M}*");

    foreach (var word in words)
    {
        Match graphemeMatch = graphemeExpression.Match(word);
        while (graphemeMatch.Success)
        {
            string grapheme = graphemeMatch.Value;

            SoundCount count;
            if (!soundCounts.TryGetValue(grapheme, out count))
            {
                count = new SoundCount() { Sound = grapheme };
                soundCounts.Add(grapheme, count);
            }
            count.Count++;

            graphemeMatch = graphemeMatch.NextMatch();
        }
    }

    return new List<SoundCount>(soundCounts.Values);
}

Performance: In my testing, I found that the TextElementEnumerator was about 4 times as fast as the regular expression.

Realistic scenario

Unfortunately, there is no way to "tweak" how the TextElementEnumerator enumerates, so that class will be of no use in the realistic scenario.

One solution is to tweak our regular expression:

[\P{M}\P{Lm}]      # Match a character that is NOT a character intended to be combined with another character or a special character that is used like a letter
(?:                # Start a group for the combining characters:
  (?:                # Start a group for tied characters:
    [\u035C\u0361]      # Match an under- or over- tie bar...
    \P{M}\p{M}*         # ...followed by another grapheme (in the simplified sense)
  )                  # (End the tied characters group)
  |\p{M}             # OR a character intended to be combined with another character
  |\p{Lm}            # OR a special character that is used like a letter
)*                 # Match the combining characters group zero or more times.

We could probably also create our own IEnumerator<string> using CharUnicodeInfo.GetUnicodeCategory to regain our performace, but that seems like too much work to me and extra code to maintain. (Anyone else want to have a go?) Regexes are made for this.

这篇关于通过字形,而不是性格枚举字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆