如何识别邪恶的正则表达式? [英] How can I recognize an evil regex?

查看:32
本文介绍了如何识别邪恶的正则表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现了正则表达式拒绝服务攻击,并决定根除-在我的代码库中可以找到它们的任何地方都称为邪恶"正则表达式模式 - 或者至少是那些用于用户输入的模式.上面的OWASP 链接维基百科 很有帮助,但它们并没有很好地用简单的术语解释问题.

I recently became aware of Regular expression Denial of Service attacks, and decided to root out so-called 'evil' regex patterns wherever I could find them in my codebase - or at least those that are used on user input. The examples given at the OWASP link above and wikipedia are helpful, but they don't do a great job of explaining the problem in simple terms.

对邪恶正则表达式的描述,来自维基百科:

A description of evil regexes, from wikipedia:

  • 正则表达式将重复(+"、*")应用于复杂的子表达式;
  • 对于重复的子表达式,存在一个匹配项,该匹配项也是另一个有效匹配项的后缀.

举个例子,同样来自维基百科:

  • (a+)+
  • ([a-zA-Z]+)*
  • (a|aa)+
  • (a|a?)+
  • (.*a){x} for x > 10
  • (a+)+
  • ([a-zA-Z]+)*
  • (a|aa)+
  • (a|a?)+
  • (.*a){x} for x > 10

这是一个没有更简单解释的问题吗?我正在寻找可以在编写正则表达式时更轻松地避免此问题的东西,或者可以在现有代码库中找到它们.

Is this a problem that just doesn't have a simpler explanation? I'm looking for something that would make it easier to avoid this problem while writing regexes, or to find them within an existing codebase.

推荐答案

为什么邪恶的正则表达式是个问题?

因为计算机完全按照您的要求执行,即使这不是您的意思或完全不合理.如果您要求正则表达式引擎证明,对于某些给定的输入,给定模式匹配或不匹配,那么无论必须测试多少种不同的组合,引擎都会尝试这样做.

Why Are Evil Regexes A Problem?

Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a Regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested.

这是一个受 OP 帖子中第一个示例启发的简单模式:

Here is a simple pattern inspired by the first example in the OP's post:

^((ab)*)+$

给定输入:

abababababababababababab

abababababababababababab

正则表达式引擎尝试类似 (abababababababababababab) 的内容,并且在第一次尝试时就找到了匹配项.

The regex engine tries something like (abababababababababababab) and a match is found on the first try.

然后我们把活动扳手扔进去:

But then we throw the monkey wrench in:

abababababababababababab a

abababababababababababab a

引擎将首先尝试 (ababababababababababab) 但由于额外的 a 而失败.这会导致灾难性的跟踪,因为我们的模式 (ab)* 出于善意,将释放其中一个捕获(它将回溯")并让外部模式再次尝试.对于我们的正则表达式引擎,它看起来像这样:

The engine will first try (abababababababababababab) but that fails because of that extra a. This causes catastrophic bracktracking, because our pattern (ab)*, in a show of good faith, will release one of it's captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this:

(abababababababababababab) - 没有
(abababababababababab)(ab) - 没有
(ababababababababab)(abab) - 没有
(abababababababababab)(ab)(ab) - 没有
(abababababababab)(ababab) - 没有
(abababababababab)(abab)(ab) - 没有
(abababababababab)(ab)(abab) - 没有
(abababababababab)(ab)(ab)(ab) - 没有
(abababababababab)(abababab) - 没有
(abababababababab)(ababab)(ab) - 没有
(abababababababab)(abab)(abab) - 没有
(abababababababab)(abab)(ab)(ab) - 没有
(abababababababab)(ab)(ababab) - 没有
(abababababababab)(ab)(abab)(ab) - 没有
(abababababababab)(ab)(ab)(abab) - 没有
(abababababababab)(ab)(ab)(ab)(ab) - 没有
(ababababababab)(ababababab) - 没有
(ababababababab)(abababab)(ab) - 没有
(ababababababab)(ababab)(abab) - 没有
(abababababab)(ababab)(ab)(ab) - 不
(ababababababab)(abab)(abab)(ab) - 没有
(ababababababab)(abab)(ab)(abab) - 没有
(ababababababab)(abab)(ab)(ab)(ab) - 没有
(abababababab)(ab)(abababab) - 没有
(abababababab)(ab)(ababab)(ab) - 不
(ababababababab)(ab)(abab)(abab) - 没有
(ababababababab)(ab)(abab)(ab)(ab) - 没有
(abababababab)(ab)(ab)(ababab) - 不
(ababababababab)(ab)(ab)(abab)(ab) - 没有
(ababababababab)(ab)(ab)(ab)(abab) - 没有
(ababababababab)(ab)(ab)(ab)(ab)(ab) - 没有
...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab) - 不
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab) - 不
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab) - 不
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab) - 不
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab) - 不
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab) - 不
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab) - 不
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab) - 没有

(abababababababababababab) - Nope
(ababababababababababab)(ab) - Nope
(abababababababababab)(abab) - Nope
(abababababababababab)(ab)(ab) - Nope
(ababababababababab)(ababab) - Nope
(ababababababababab)(abab)(ab) - Nope
(ababababababababab)(ab)(abab) - Nope
(ababababababababab)(ab)(ab)(ab) - Nope
(abababababababab)(abababab) - Nope
(abababababababab)(ababab)(ab) - Nope
(abababababababab)(abab)(abab) - Nope
(abababababababab)(abab)(ab)(ab) - Nope
(abababababababab)(ab)(ababab) - Nope
(abababababababab)(ab)(abab)(ab) - Nope
(abababababababab)(ab)(ab)(abab) - Nope
(abababababababab)(ab)(ab)(ab)(ab) - Nope
(ababababababab)(ababababab) - Nope
(ababababababab)(abababab)(ab) - Nope
(ababababababab)(ababab)(abab) - Nope
(ababababababab)(ababab)(ab)(ab) - Nope
(ababababababab)(abab)(abab)(ab) - Nope
(ababababababab)(abab)(ab)(abab) - Nope
(ababababababab)(abab)(ab)(ab)(ab) - Nope
(ababababababab)(ab)(abababab) - Nope
(ababababababab)(ab)(ababab)(ab) - Nope
(ababababababab)(ab)(abab)(abab) - Nope
(ababababababab)(ab)(abab)(ab)(ab) - Nope
(ababababababab)(ab)(ab)(ababab) - Nope
(ababababababab)(ab)(ab)(abab)(ab) - Nope
(ababababababab)(ab)(ab)(ab)(abab) - Nope
(ababababababab)(ab)(ab)(ab)(ab)(ab) - Nope
                              ...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab) - Nope

可能组合的数量随着输入的长度呈指数级增长,在您知道之前,正则表达式引擎正在耗尽您所有的系统资源试图解决这个问题,直到用尽所有可能的术语组合,最终放弃并报告没有匹配项".与此同时,你的服务器变成了一堆熔化的金属.

The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal.

这实际上非常棘手.我自己写了一些,尽管我知道它们是什么以及通常如何避免它们.请参阅正则表达式花费了惊人的时间.在原子组中包装所有你能做的事情有助于防止回溯问题.它基本上告诉正则表达式引擎不要重新访问给定的表达式 - 锁定您在第一次尝试时匹配的任何内容".但是请注意,原子表达式并不能阻止在表达式中回溯 ,因此 ^(?>((ab)*)+)$ 仍然很危险,但是^(?>(ab)*)+$ 是安全的(它会匹配 (ababababababababababab) 然后拒绝放弃任何匹配的字符,从而防止灾难性的回溯).

It's actually very tricky. I have written a couple myself, even though I know what they are and generally how to avoid them. See Regex taking surprisingly long time. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so ^(?>((ab)*)+)$ is still dangerous, but ^(?>(ab)*)+$ is safe (it'll match (abababababababababababab) and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking).

不幸的是,一旦写好,实际上很难立即或快速找到有问题的正则表达式.最后,识别错误的正则表达式就像识别任何其他错误代码 - 这需要大量的时间和经验和/或单一的灾难性事件.

Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event.

有趣的是,自从第一次写出这个答案后,德克萨斯大学奥斯汀分校的一个团队发表了一篇论文,描述了一种能够对正则表达式进行静态分析的工具的开发,其明确目的是找到这些邪恶"的正则表达式.模式.该工具是为分析 Java 程序而开发的,但我怀疑在未来几年,我们会看到更多围绕分析和检测 JavaScript 和其他语言中的问题模式而开发的工具,尤其是 ReDoS 攻击率继续攀升.

Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of Regular Expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around anaylsing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb.

DoS 漏洞的静态检测使用正则表达式的程序
Valentin Wüstholz、Oswaldo Olivo、Marijn J. H. Heule 和 Isil Dillig
德克萨斯大学奥斯汀分校

Static Detection of DoS Vulnerabilities in Programs that use Regular Expressions
Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig
The University of Texas at Austin

这篇关于如何识别邪恶的正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆