奇怪的string.IndexOf行为 [英] strange string.IndexOf behavour

查看:85
本文介绍了奇怪的string.IndexOf行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了以下代码片段,以消除文本板上多余的空格

I wrote the following snippet to get rid of excessive spaces in slabs of text

int index = text.IndexOf("  ");
while (index > 0)
{
    text = text.Replace("  ", " ");
    index = text.IndexOf("  ");
}

一般来说,它工作得很好,尽管它是原始的并且可能效率低下.

Generally this works fine, albeit rather primative and possibly inefficient.

当文本包含"-"由于某些奇怪的原因,indexOf返回一个匹配项! 替换功能不会删除任何内容,因此会陷入无限循环.

When the text contains " - " for some bizzare reason the indexOf returns a match! The Replace function doesn't remove anything and then it is stuck in a endless loop.

任何想法,string.IndexOf是怎么回事?

Any ideas what is going on with the string.IndexOf?

推荐答案

啊,文本的乐趣.

您最有可能在那里但在SO上发布时迷路的是软连字符".

What you most likely have there, but got lost when posting on SO, is a "soft hyphen".

为重现该问题,我在 LINQPad 中尝试了以下代码:

To reproduce the problem, I tried this code in LINQPad:

void Main()
{
    var text = "Test1 \u00ad Test2";
    int index = text.IndexOf("  ");
    while (index > 0)
    {
        text = text.Replace("  ", " ");
        index = text.IndexOf("  ");
    }
}

当然,上面的代码只是陷入了循环.

And sure enough, the above code just gets stuck in a loop.

请注意,根据CharMap,\u00ad是软连字符的Unicode符号.您也可以始终从CharMap复制和粘贴字符,但是将其张贴在SO上将用更常见的表亲Hyphen-Minus Unicode符号u002d(键盘上的那个字符)代替它.

Note that \u00ad is the Unicode symbol for Soft Hyphen, according to CharMap. You can always copy and paste the character from CharMap as well, but posting it here on SO will replace it with its much more common cousin, the Hyphen-Minus, Unicode symbol u002d (the one on your keyboard.)

您可以阅读字符串类关于这个问题的话:

You can read a small section in the documentation for the String Class which has this to say on the subject:

字符串搜索方法,例如String.StartsWith和String.IndexOf,也可以执行区分文化的或有序的字符串比较.下面的示例说明使用IndexOf方法进行序数比较和对文化敏感的比较之间的区别.在一种对文化敏感的搜索中,当前的文化是英语(美国),认为子字符串"oe"与连字œ"匹配. 由于软连字符(U + 00AD)是零宽度字符,因此搜索将软连字符视为等同于Empty,并在字符串的开头找到匹配项.另一方面,在任何情况下都找不到匹配项.

String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "œ". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.

我已经突出显示了相关部分,但我还记得前一段时间有关此确切问题的博客文章,但是我的Google-Fu今晚让我失望了.

I've highlighted the relevant part, but I also remember a blog post about this exact problem a while back but my Google-Fu is failing me tonight.

这里的问题是IndexOf和Replace使用不同的方法来定位文本.

The problem here is that IndexOf and Replace use different methods for locating the text.

鉴于IndexOf将软连字符视为不在此处",因此发现其每一侧的两个空格为两个连接的空格",而Replace方法不会,因此不会删除其中的任何一个他们.因此,存在用于循环继续迭代的条件,但是由于Replace不会删除符合条件的空格,因此它将永远不会结束.毫无疑问,Unicode符号空间中还有其他类似的字符也存在类似的问题,但这是我所见过的最典型的情况.

Whereas IndexOf will consider the soft hyphen as "not really there", and thus discover the two spaces on each side of it as "two joined spaces", the Replace method won't, and thus won't remove either of them. Therefore the criteria is present for the loop to continue iterating, but since Replace doesn't remove the spaces that fit the criteria, it will never end. Undoubtedly there are other such characters in the Unicode symbol space that exhibit similar problems, but this is the most typical case I've seen.

至少有两种方法可以解决此问题:

There's at least two ways of handling this:

  1. 您可以使用Regex.Replace,它似乎没有此问题:

  1. You can use Regex.Replace, which seems to not have this problem:

text = Regex.Replace(text, "  +", " ");

我个人可能会在正则表达式中使用空格特殊字符,即\s,但是如果您只想使用空格,则上面的代码应该可以解决问题.

Personally I would probably use the whitespace special character in the Regular Expression, which is \s, but if you only want spaces, the above should do the trick.

您可以明确地要求IndexOf使用序数比较,而序数比较不会被文本行为所绊倒,就像……好吧……text:

You can explicitly ask IndexOf to use an ordinal comparison, which won't get tripped up by text behaving like ... well ... text:

index = text.IndexOf("  ", StringComparison.Ordinal);

这篇关于奇怪的string.IndexOf行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆