我怎么能执行一个文化敏感和QUOT;启动 - 与"从字符串的中间操作? [英] How can I perform a culture-sensitive "starts-with" operation from the middle of a string?

查看:101
本文介绍了我怎么能执行一个文化敏感和QUOT;启动 - 与"从字符串的中间操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要求,这是比较模糊的,但感觉像它的的可以使用BCL。

有关背景下,我解析一个日期/时间字符串野田时间。我保持输入字符串在我的位置的逻辑指针。因此,尽管整个字符串可能是2013 1月3日的逻辑光标可能会在'J'。

现在,我需要解析月份名称,这对所有的文化已知的月份名称比较:

  • 文化-敏感
  • 在不区分大小写
  • 单从光标点(不晚,我想看看光标是看的候选人月份名称)
  • 快速
  • ...我需要知道后来有多少个字符,使用

的<一个href="https://$c$c.google.com/p/noda-time/source/browse/src/NodaTime/Text/ValueCursor.cs?r=40f44c857d94f450372d768a8cef835c3ed1ed4f#60">current code 以一般做这个工作,使用 CompareInfo.Compare 。这是有效像这样(只为匹配的部分 - 还有更多的code。在真实的东西,但它不相关的匹配):

 内部布尔MatchCaseInsensitive(字符串候选人,CompareInfo compareInfo)
{
    返回compareInfo.Compare(文字,位置,candidate.Length,
                               候选人,0,candidate.Length,
                               CompareOptions.IgnoreCase)== 0;
}
 

然而,这依赖于候选和我们比较是相同长度的区域。精细的大部分时间,但是的没有的罚款,一些特殊情况。假设我们有这样的:

  // U + 00E9是电子急单code点
VAR文本=X B \ u00e9d Y;
INT位置= 2;
// E,最后跟U + 0301仍然意味着电子尖锐,但是从两个code点
VAR候选人=是\ u0301d;
 

现在我比较会失败。我可以用 是preFIX

 如果(compareInfo.Is preFIX(text.Substring(位置),候选人,
                         CompareOptions.IgnoreCase))
 

不过:<​​/ P>

  • 这需要我创造一个子,这我真的宁愿避免。 (我看野田的时间作为一个有效的系统库;解析性能可能是很重要的一些客户端)
  • 在它没有告诉我有多远推进光标之后

在现实中,我强烈怀疑这会不会拿出很多时候......但我真的的喜欢的在这里做正确的事情。我也真的很希望能够做到这一点没有成为一个统一code专家或执行它自己:)

(募集的错误210 在Noda时间,在至于有人要遵循任何最终结论。)

我喜欢正常化的想法。我需要检查详细的)的正确性和b)性能。假设我的可以的让它正常工作,我真不知道怎么它是否是值得改变在所有 - 这是诸如此类的事情,这将可能的永远的居然拿出在现实生活中,却可能伤害我的所有用户的性能:(

我还检查了BCL - 这似乎并没有处理这一正确要么。样品code:

 使用系统;
使用System.Globalization;

类测试
{
    静态无效的主要()
    {
        VAR文化=(的CultureInfo)CultureInfo.InvariantCulture.Clone();
        VAR个月= culture.DateTimeFormat.AbbreviatedMonthNames;
        个月[10] =是\ u0301d;
        culture.DateTimeFormat.AbbreviatedMonthNames =月;

        VAR文本=25×B \ u00e9d 2013​​;
        VAR模式=DD MMM YYYY;
        DateTime的结果;
        如果(DateTime.TryParseExact(文字,图案,文化,
                                   DateTimeStyles.None,出结果))
        {
            Console.WriteLine(!解析的结果= {0},结果);
        }
        其他
        {
            Console.WriteLine(未解析);
        }
    }
}
 

更改自定义的月份名称只是床与床的文本值分析罚款。

好了,多几个​​数据点:

  • 使用成本子串是preFIX 是显著,但并不可怕。上的星期五年4月12个2013年20时28分42秒的我的发展膝上型一个样品,它改变解析操作我可以在约460K的第二至约400K执行的数目。我宁愿避免这种放缓可能的话,但它不是的的不好。

  • 规范化是不太可行比我想象的 - 因为它不是在便携式类库可用。我可能会使用它的只是的非PCL建立,使PCL构建以少一点正确的。测试的正常化( string.IsNormalized )的性能损失需要的性能下降到每秒445K电话,我可以住在一起。我真不知道它的一切,我需要它 - 例如含有SS一个月名称应与在许多文化中SS,我相信... ...和规范并没有做到这一点。

解决方案

我会考虑很多与LT的问题; - >单/多casemappings第一和单独处理不同的标准化形式

例如:

  Xheißeÿ
  ^  - 光标
 

匹配 heisse 但然后移动光标1太大。和:

  X heisseÿ
  ^  - 光标
 

匹配heiße但然后移动光标1太少。

这将适用于不具有一个简单的一对一的映射的任何字符。

您需要知道实际匹配的子字符串的长度。但比较的IndexOf ...等 扔掉那些信息了。这可能是可能的常规EX pressions但实施并不做充分的情况下折叠,因此不匹配 SS SS / SS 在不区分大小写模式下,即使 .Compare .IndexOf 做的。而且它很可能是昂贵的,以创造新的正则表达式 为每一位候选人反正。

最简单的解决方案,这是对的情况下折叠形式只是在内部存储字符串并做二进制比较有大小写折叠候选人。然后你可以 正确地移动光标,只需 .Length 因为光标是内部重新presentation。您还可以得到大部分丧失的表现 由于不必使用回 CompareOptions.IgnoreCase

不幸的是,任何情况下折叠功能内置和穷人的情况下折叠也不行,因为没有完整的案例映射 - 对 ToUpper的方法 不转 SS SS

例如这部作品在爪哇(甚至在Javascript),因为字符串是范式C:

  //可怜的人的情况下折叠。
//还有一些边缘情况下,这不起作用
公共静态字符串toCaseFold(字符串输入,区域设置的CultureInfo){
    返回input.toUpperCase(的CultureInfo).toLowerCase(CultureInfo的);
}
 

有趣的要注意的是Java的忽略情形相比,没有做充分的情况下折叠,如C#的 CompareOptions.IgnoreCase 。因此,他们对在这方面:Java的 不完全casemapping,但简单的情况下,折叠 - C#做简单casemapping,但全案折叠。

所以,这可能是你需要一个第三方库,以区分使用前折的字符串。


在做什么,你必须确保你的字符串的正常形态C.你可以使用这个preliminary快速检查拉丁字母优化:

 公共静态布尔MaybeRequiresNormalizationToFormC(字符串输入)
{
    如果(输入== NULL)抛出新ArgumentNullException(输入);

    INT LEN = input.Length;
    的for(int i = 0; I&LT; LEN ++ I)
    {
        如果(输入[I]≥0x2FF)
        {
            返回true;
        }
    }

    返回false;
}
 

这给了假阳性,但不漏报,我不希望它慢下来使用拉丁字符的字符时,即使它需要在每一个字符串进行解析460K /秒的。 随着假阳性,你会使用 IsNormalized 来获得真正的正/负,只有在必要时即恢复正常。


所以在最后的处理是保证正常的C形,再折的情况下。执行二进制比较与处理字符串和移动光标,你正在它目前。

I have a requirement which is relatively obscure, but it feels like it should be possible using the BCL.

For context, I'm parsing a date/time string in Noda Time. I maintain a logical cursor for my position within the input string. So while the complete string may be "3 January 2013" the logical cursor may be at the 'J'.

Now, I need to parse the month name, comparing it against all the known month names for the culture:

  • Culture-sensitively
  • Case-insensitively
  • Just from the point of the cursor (not later; I want to see if the cursor is "looking at" the candidate month name)
  • Quickly
  • ... and I need to know afterwards how many characters were used

The current code to do this generally works, using CompareInfo.Compare. It's effectively like this (just for the matching part - there's more code in the real thing, but it's not relevant to the match):

internal bool MatchCaseInsensitive(string candidate, CompareInfo compareInfo)
{
    return compareInfo.Compare(text, position, candidate.Length,
                               candidate, 0, candidate.Length, 
                               CompareOptions.IgnoreCase) == 0;
}

However, that relies on the candidate and the region we compare being the same length. Fine most of the time, but not fine in some special cases. Suppose we have something like:

// U+00E9 is a single code point for e-acute
var text = "x b\u00e9d y";
int position = 2;
// e followed by U+0301 still means e-acute, but from two code points
var candidate = "be\u0301d";

Now my comparison will fail. I could use IsPrefix:

if (compareInfo.IsPrefix(text.Substring(position), candidate,
                         CompareOptions.IgnoreCase))

but:

  • That requires me to create a substring, which I'd really rather avoid. (I'm viewing Noda Time as effectively a system library; parsing performance may well be important to some clients.)
  • It doesn't tell me how far to advance the cursor afterwards

In reality, I strongly suspect this won't come up very often... but I'd really like to do the right thing here. I'd also really like to be able to do it without becoming a Unicode expert or implementing it myself :)

(Raised as bug 210 in Noda Time, in case anyone wants to follow any eventual conclusion.)

I like the idea of normalization. I need to check that in detail for a) correctness and b) performance. Assuming I can make it work correctly, I'm still not sure how whether it would be worth changing over all - it's the sort of thing which will probably never actually come up in real life, but could hurt the performance of all my users :(

I've also checked the BCL - which doesn't appear to handle this properly either. Sample code:

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        var culture = (CultureInfo) CultureInfo.InvariantCulture.Clone();
        var months = culture.DateTimeFormat.AbbreviatedMonthNames;
        months[10] = "be\u0301d";
        culture.DateTimeFormat.AbbreviatedMonthNames = months;

        var text = "25 b\u00e9d 2013";
        var pattern = "dd MMM yyyy";
        DateTime result;
        if (DateTime.TryParseExact(text, pattern, culture,
                                   DateTimeStyles.None, out result))
        {
            Console.WriteLine("Parsed! Result={0}", result);
        }
        else
        {
            Console.WriteLine("Didn't parse");
        }
    }
}

Changing the custom month name to just "bed" with a text value of "bEd" parses fine.

Okay, a few more data points:

  • The cost of using Substring and IsPrefix is significant but not horrible. On a sample of "Friday April 12 2013 20:28:42" on my development laptop, it changes the number of parse operations I can execute in a second from about 460K to about 400K. I'd rather avoid that slowdown if possible, but it's not too bad.

  • Normalization is less feasible than I thought - because it's not available in Portable Class Libraries. I could potentially use it just for non-PCL builds, allowing the PCL builds to be a little less correct. The performance hit of testing for normalization (string.IsNormalized) takes performance down to about 445K calls per second, which I can live with. I'm still not sure it does everything I need it to - for example, a month name containing "ß" should match "ss" in many cultures, I believe... and normalizing doesn't do that.

解决方案

I'll consider the problem of many<->one/many casemappings first and separately from handling different Normalization forms.

For example:

x heiße y
  ^--- cursor

Matches heisse but then moves cursor 1 too much. And:

x heisse y
  ^--- cursor

Matches heiße but then moves cursor 1 too less.

This will apply to any character that doesn't have a simple one-to-one mapping.

You would need to know the length of the substring that was actually matched. But Compare, IndexOf ..etc throw that information away. It could be possible with regular expressions but the implementation doesn't do full case folding and so doesn't match ß to ss/SS in case-insensitive mode even though .Compare and .IndexOf do. And it would probably be costly to create new regexes for every candidate anyway.

The simplest solution to this is to just internally store strings in case folded form and do binary comparisons with case folded candidates. Then you can move the cursor correctly with just .Length since the cursor is for internal representation. You also get most of the lost performance back from not having to use CompareOptions.IgnoreCase.

Unfortunately there is no case fold function built-in and the poor man's case folding doesn't work either because there is no full case mapping - the ToUpper method doesn't turn ß into SS.

For example this works in Java (and even in Javascript), given string that is in Normal Form C:

//Poor man's case folding.
//There are some edge cases where this doesn't work
public static String toCaseFold( String input, Locale cultureInfo ) {
    return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
}

Fun to note that Java's ignore case comparison doesn't do full case folding like C#'s CompareOptions.IgnoreCase. So they are opposite in this regard: Java does full casemapping, but simple case folding - C# does simple casemapping, but full case folding.

So it's likely that you need a 3rd party library to case fold your strings before using them.


Before doing anything you have to be sure that your strings are in normal form C. You can use this preliminary quick check optimized for Latin script:

public static bool MaybeRequiresNormalizationToFormC(string input)
{
    if( input == null ) throw new ArgumentNullException("input");

    int len = input.Length;
    for (int i = 0; i < len; ++i)
    {
        if (input[i] > 0x2FF)
        {
            return true;
        }
    }

    return false;
}

This gives false positives but not false negatives, I don't expect it to slow down 460k parses/s at all when using Latin script characters even though it needs to be performed on every string. With a false positive you would use IsNormalized to get a true negative/positive and only after that normalize if necessary.


So in conclusion, the processing is to ensure normal form C first, then case fold. Do binary comparisons with the processed strings and move cursor as you are moving it currently.

这篇关于我怎么能执行一个文化敏感和QUOT;启动 - 与&QUOT;从字符串的中间操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆