正则表达式以令人惊讶的很长一段时间 [英] Regex taking surprisingly long time

查看:202
本文介绍了正则表达式以令人惊讶的很长一段时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用户输入的搜索字符串。通常情况下,搜索字符串分割使用空格然后进行OR搜索(一个项目,如果它匹配任何搜索字符串的元素相匹配)。我想提供一些先进的查询功能,如用引号括住包含空格的文字短语的能力。



不过,我觉得我已经敲定了一个像样的正则表达式分裂的字符串对我来说,但它采取了令人惊讶的长的时间来执行(>2秒我的机器上)。我打破了它弄清楚流向何方打嗝是,更有趣的是,似乎最后的赛后发生匹配(想必,在输入的结尾) 。所有到字符串匹配的在更短的时间结束比赛的话,我可以捕获,但最后一场比赛(如果这就是它 - 没有返回)。需要的几乎所有的2秒



我希望有人可能有一些见解如何可以加快这个表达式了一下。我知道我使用的是后向采用无界量词,但就像我说的,这似乎并没有造成任何性能问题,直到最后一场比赛已经匹配了。



代码

 使用系统; 
使用System.Collections.Generic;
使用System.Linq的;
使用System.Text;使用System.Text.RegularExpressions
;

命名空间RegexSandboxCSharp {
类节目{
静态无效的主要(字串[] args){

串l_input1 =#1 \两three\四之五:\六seven\八\九ten\;

串l_pattern =
@(小于?= ^([^?] *([] [^] * [,]))*) \s +;

正则表达式l_regex ​​=新的正则表达式(l_pattern);

MatchCollection l_matches = l_regex.Matches(l_input1);
System.Collections.IEnumerator l_matchEnumerator = l_matches.GetEnumerator();

的DateTime l_listStart = DateTime.Now;
名单,LT;字符串> l_elements =新的List<串GT;();
INT l_previousIndex = 0;
INT l_previousLength = 0;
//最后的MoveNext(),它返回false,需要2秒。
,而(l_matchEnumerator.MoveNext()){
匹配l_match =(匹配)l_matchEnumerator.Current;
INT l_start = l_previousIndex + l_previousLength;
INT l_length = l_match.Index - l_start;
l_elements.Add(l_input1.Substring(l_start,l_length));

l_previousIndex = l_match.Index;
l_previousLength = l_match.Length;
}
Console.WriteLine(列出的组合物的时间:+(DateTime.Now - l_listStart).TotalMilliseconds.ToString());

的String [] = l_terms l_elements.ToArray();

Console.WriteLine(的string.join(\\\
,l_terms));

Console.ReadKey(真);

}
}
}



输出结果
(这正是我得到。)







一个结果
二三结果
4结果
五:六七结果
8结果
十人九




解决方案

试着改变你的正则表达式如下:

 (小于= ^( (大于?[^] *)(?[] [^] * []))*)\s + 

这里唯一的变化是将 [^] * atomic组,这将阻止灾难性回溯的发生。



注意:上面的正则表达式是显然不使用C#正则表达式字符串语法,这我很熟悉,但我认为这将是以下内容:

  @?(小于= ^((> [^] *)([ ] [^] * []))*)\s +?; 



为什么灾难发生回溯:结果
一旦所有有效匹配已经发现,尝试下一个匹配是最终引部的内部的空间。因为有一个奇数空间之前行情的回顾后会失败。



在这一点上的后向内的正则表达式将开始原路返回。锚意味着它将总是开始在字符串的开头,但它仍然可以通过从它已经匹配了端脱落元件原路返回。让我们看看正则表达式的后向内侧:

  ^([^] *([] [^] * [])?)* 

由于引述部分都是可选的,它们可以被丢弃的正则表达式回溯。对于非引号字符是不带引号的段内的每个块,前回溯每个字符将被匹配为在的 [^] * 的一部分开始正则表达式。至于回溯在该节的最后一个字符将被丢弃开始什么 [^] * 匹配,并将由外重复领取。在这一点上就变得非常类似于在上面的灾难性回溯链路的例子


I have a search string entered by a user. Normally, the search string is split up using whitespace and then an OR search is performed (an item matches if it matches any of the search string elements). I want to provide a few "advanced" query features, such as the ability to use quotes to enclose literal phrases containing whitespace.

I though I had hammered out a decent regex to split up the strings for me, but it's taking a surprisingly long time to execute (> 2 seconds on my machine). I broke it out to figure out just where the hiccup was, and even more interestingly it seems to occur after the last Match is matched (presumably, at the end of the input). All of the matches up to the end of the string match in less time then I can capture, but that last match (if that's what it is - nothing returns) takes almost all of the 2 seconds.

I was hoping someone might have some insight into how I can speed this regex up a bit. I know I'm using a lookbehind with an unbounded quantifier but, like I said, this doesn't seem to cause any performance issues until after the last match has been matched.

CODE

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace RegexSandboxCSharp {
    class Program {
        static void Main( string[] args ) {

            string l_input1 = "# one  \"two three\" four five:\"six seven\"  eight \"nine ten\"";

            string l_pattern =
                @"(?<=^([^""]*([""][^""]*[""])?)*)\s+";

            Regex l_regex = new Regex( l_pattern );

            MatchCollection l_matches = l_regex.Matches( l_input1 );
            System.Collections.IEnumerator l_matchEnumerator = l_matches.GetEnumerator();

            DateTime l_listStart = DateTime.Now;
            List<string> l_elements = new List<string>();
            int l_previousIndex = 0;
            int l_previousLength = 0;
            //      The final MoveNext(), which returns false, takes 2 seconds.
            while ( l_matchEnumerator.MoveNext() ) {
                Match l_match = (Match) l_matchEnumerator.Current;
                int l_start = l_previousIndex + l_previousLength;
                int l_length = l_match.Index - l_start;
                l_elements.Add( l_input1.Substring( l_start, l_length ) );

                l_previousIndex = l_match.Index;
                l_previousLength = l_match.Length;
            }
            Console.WriteLine( "List Composition Time: " + ( DateTime.Now - l_listStart ).TotalMilliseconds.ToString() );

            string[] l_terms = l_elements.ToArray();

            Console.WriteLine( String.Join( "\n", l_terms ) );

            Console.ReadKey( true );

        }
    }
}

OUTPUT
(This is exactly what I'm getting.)

one
"two three"
four
five:"six seven"
eight
"nine ten"

解决方案

Try changing your regex to the following:

(?<=^((?>[^"]*)(["][^"]*["])?)*)\s+

The only change here is to put the [^"]* into an atomic group, which prevents the catastrophic backtracking that occurs.

Note: The regex above is obviously does not use C# regex string syntax, which I am unfamiliar with, but I think it would be the following:

@"(?<=^((?>[^""]*)([""][^""]*[""])?)*)\s+";

Why the catastrophic backtracking occurs:
Once all of the valid matches have been found the next match that is attempted is the space inside of the final quoted section. The lookbehind will fail because there are an odd number of quotes before the space.

At this point the regex inside of the lookbehind will start to backtrack. The anchor means it will always start at the beginning of the string, but it can still backtrack by dropping elements from the end of what it has matched. Lets look at the regex inside of the lookbehind:

^([^"]*(["][^"]*["])?)*

Since the quoted sections are optional, they can be dropped as the regex backtracks. For every chunk of non-quote characters that are not inside of a quoted section, before backtracking each character would have been matched as a part of the [^"]* at the beginning of the regex. As backtracking begins on that section the last character will be dropped from what the [^"]* matched, and will be picked up by the outer repetition. At this point it becomes very similar to the example in the catastrophic backtracking link above.

这篇关于正则表达式以令人惊讶的很长一段时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆