正则表达式模式灾难性回溯 [英] Regex Pattern Catastrophic backtracking

查看:340
本文介绍了正则表达式模式灾难性回溯的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的旧Java系统中使用了下面显示的正则表达式,最近导致了回溯问题。
通常回溯线程会导致机器的CPU达到上限,并且在重新启动应用程序之前它不会返回。

I have the regex shown below used in one of my old Java systems which is causing backtracking issues lately. Quite often the backtracking threads cause the CPU of the machine to hit the upper limit and it does not return back until the application is restarted.

任何人都可以建议更好的方法来重写这种模式或者一种可以帮助我这样做的工具吗?

Could any one suggest a better way to rewrite this pattern or a tool which would help me to do so?

模式:

^\[(([\p{N}]*\]\,\[[\p{N}]*)*|[\p{N}]*)\]$

工作价值:

[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]

灾难性的回溯值 - 如果值中有任何错误(最后添加额外的括号):

Catastrophic backtracking values — if anything is wrong in the values (an extra brace added at the end):

[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]]


推荐答案

+ 时,切勿使用 * 你是什​​么意思。我注意到你的正则表达式的第一件事是几乎所有东西都是可选的。只需要打开和关闭方括号,我很确定你不想将 [] 视为有效输入。

Never use * when + is what you mean. The first thing I noticed about your regex is that almost everything is optional. Only the opening and closing square brackets are required, and I'm pretty sure you don't want to treat [] as a valid input.

失控回溯的最大原因之一是有两个或更多可以匹配相同内容的替代品。这就是你用 | [\\\ {N}] * 部分所得到的。正则表达式引擎必须在它放弃之前尝试通过字符串的每个可想到的路径,所以所有那些 \p {N} * 构造进入无休止的拔河比赛每组数字。

One of the biggest causes of runaway backtracking is to have two or more alternatives that can match the same things. That's what you've got with the |[\p{N}]* part. The regex engine has to try every conceivable path through the string before it gives up, so all those \p{N}* constructs get into an endless tug-of-war over every group of digits.

但是没有必要尝试修复这些问题,因为整体结构是错误的。我想这就是你要找的东西:

But there's no point trying to fix those problems, because the overall structure is wrong. I think this is what you're looking for:

^\[\p{N}+\](?:,\[\p{N}+\])*$

之后使用第一个令牌( [1234567] ),如果字符串中的下一个不是逗号或字符串的结尾,则会立即失败。如果 看到一个逗号,它必须继续匹配另一个完整的令牌( [89023432] ),否则它会立即失败。

After it consumes the first token ([1234567]), if the next thing in the string is not a comma or the end of the string, it fails immediately. If it does see a comma, it must go on to match another complete token ([89023432]), or it fails immediately.

当你创建一个正则表达式时,这可能是最重要的事情:如果它会失败,你希望它尽快失败。您可以使用原子组和占有量词等功能,但如果正确使用正则表达式的结构,则很少需要它们。回溯并非不可避免。

That's probably the most important thing to remember when you're creating a regex: if it's going to fail, you want it to fail as quickly as possible. You can use features like atomic groups and possessive quantifiers toward that end, but if you get the structure of the regex right, you rarely need them. Backtracking is not inevitable.

这篇关于正则表达式模式灾难性回溯的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆