如何避免在.NET正则表达式类的无限循环? [英] How to avoid infinite loops in the .NET RegEx class?

查看:140
本文介绍了如何避免在.NET正则表达式类的无限循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个简单的任务获得的XPath EX pression并返回preFIX这是(可能是)选择的节点的父节点匹配。

Got a simple task to get a XPath expression and return a prefix that matches the parent of the node that (might be) selected.

例如:

/aaa/bbb       =>   /aaa
/aaa/bbb/ccc   =>   /aaa/bbb
/aaa/bbb/ccc[@x='1' and @y="/aaa[name='z']"] => /aaa/bbb

由于方括号内的模式可能会包含引号内支架,我决定尝试与使用常规EX pressions实现这一目标。这里有一个code片断:

Because the patterns inside the square brackets might contain brackets within quotes, I decided to try to achieve this with the use of regular expressions. Here's a code snippet:

string input =
    "/aaa/bbb/ccc[@x='1' and @y=\"/aaa[name='z'] \"]";
                                            //  ^-- remove space for no loop
string pattern = @"/[a-zA-Z0-9]+(\[([^]]*(]"")?)+])?$";

System.Text.RegularExpressions.Regex re =
    new System.Text.RegularExpressions.Regex(pattern);
bool ismatch = re.IsMatch(input); // <== Infinite loop in here
// some code based on the match

由于该模式是比较正规的,我看了'/'后面indentifier接着匹配字符串的结尾,一个可选的组(......)?$

Because the patterns are rather regular, I looked for '/' followed by indentifier followed by an optional group that matches at the end of the string (....)?$

在code seemd工作,但与输入字符串值不同打,我发现,通过简单地插入空格(在评论中所示的位置),在.NET IsMatch函数进入一个无限循环,采取一切它得到的CPU。

The code seemd to work but playing with different values for the input string, I found that by simply inserting a space (in the location shown in the comment), the .NET IsMatch function gets into an infinite loop, taking all the CPU it gets.

现在不管这个常规的前pression模式是否是最好的(我有更多的复杂,但简化它显示这个问题),这似乎表明,使用正则表达式有什么不平凡的可能是非常危险的。

Now regardless of whether this regular expression pattern is the best one (I had more complex but simplified it to show the problem), this seems to show that using RegEx with anything not trivial may be very risky.

我缺少的东西?有没有一种方法,以防止在常规EX pression火柴无限循环?

Am I missing something? Is there a way to guard against infinite loops in regular expression matches?

推荐答案

好吧,让我们打破这则:

Ok, let's break this down then:

Input: /aaa/bbb/ccc[@x='1' and @y="/aaa[name='z'] "]
Pattern: /[a-zA-Z0-9]+(\[([^]]*(]")?)+])?$

(我假设你的意思是\在C#转义的字符串,而不是......翻译从VB.NET?)

(I assume you meant \" in your C#-escaped string, not ""... translation from VB.NET?)

首先, / [A-ZA-Z0-9] + 将吞噬通过第一方括号,留下:

First, /[a-zA-Z0-9]+ will gobble up through the first square bracket, leaving:

Input: [@x='1' and @y="/aaa[name='z'] "]

的(\ [([^] *(])?)+])?$应该匹配,如果有0或停产前1个实例。所以,让我们打破里面,看看它的外部组匹配任何东西。

The outer group of (\[([^]]*(]"")?)+])?$" should match if there is 0 or 1 instance before the EOL. So let's break inside and see if it matches anything.

在[被吞并的时候了,留给我们:

The "[" gets gobbled right away, leaving us with:

Input: @x='1' and @y="/aaa[name='z'] "]
Pattern: ([^]]*(]")?)+]

打破模式:匹配0或多个非] 字符,然后匹配的] 0或1次,并保持这样做,直到你不能。然后试图找到并吞噬一个] 之后。

Breaking down the pattern: match 0 or more non-] characters and then match "] 0 or 1 times, and keep doing this until you can't. Then try to find and gobble a ] afterward.

模式匹配基础上的 [^] * ,直到达到]

The pattern matches based on [^]]* until it reaches the ].

由于有之间有一个空格] ,就不能狼吞虎咽无论这些字符,但在之后的 (])允许它反正返回true。

Since there's a space between ] and ", it can't gobble either of those characters, but the ? after (]") allows it to return true anyway.

现在我们已经成功匹配的([^] *(])?)一次,但在 + 说,我们应该努力保持匹配任何数量的时候,我们可以!

Now we've successfully matched ([^]]*(]")?) once, but the + says we should attempt to keep matching it any number of times we can.

这给我们留下了:

Input: ] "]

这里的问题是,这个输入可以匹配([^] *(])?)无限的时候甚至没有被吞并了,并且 +将迫使它只是不断尝试。

The problem here is that this input can match ([^]]*(]")?) an infinite of times without ever being gobbled up, and "+" will force it to just keep trying.

你基本上符合1个或多个情况下,您可以匹配0或1的东西后面加0或1的东西。既然两种子模式,在剩余的输入存在,它保持匹配0 [^] \ * ,并在无限循环0 (])?

You're essentially matching "1 or more" situations where you can match "0 or 1" of something followed by "0 or 1" of something else. Since neither of the two subpatterns exists in the remaining input, it keeps matching 0 of [^]]\* and 0 of (]")? in an endless loop.

输入从未被吞并,而图案后的+永远不会被评估。其余

The input never gets gobbled, and the rest of the pattern after the "+" never gets evaluated.

(但愿我得到了SO-转义的正则表达式逃逸正上方。)

(Hopefully I got the SO-escape-of-regex-escape right above.)

这篇关于如何避免在.NET正则表达式类的无限循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆