为什么这个正则表达式会杀死Java正则表达式引擎? [英] Why does this regular expression kill the Java regex engine?

查看:127
本文介绍了为什么这个正则表达式会杀死Java正则表达式引擎?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个天真的正则表达式<([\]] | [^<])+?>(不包括引号)。看起来很简单,但是当它与下面的HTML文本相反时它确实是邪恶的。它将Java正则表达式引擎发送到无限循环。

I have this naive regex "<([\s]|[^<])+?>" (excluding the quotation marks). It seems so straightforward but it is indeed evil when it works against the below HTML text. It sends the Java regular expression engine to an infinite loop.

我有另一个正则表达式(<。+?>),这有点相同,但它不会杀死任何东西。你知道为什么会这样吗?

I have another regex ("<.+?>"), which does somewhat the same thing, but it doesn't kill anything. Do you know why this happens?

<script language="JavaScript" type="text/javascript">
        var numDivs, layerName;
        layerName = "lnavLayer";
        catLinkName = "category";
        numDivs = 2;
        function toggleLayer(layerID){
            if (!(navigator.appName == "Netscape" && navigator.appVersion.substr(0, 1) < 5)){
                thisLayer = document.getElementById(layerName + layerID);
                categoryLink = document.getElementById(catLinkName + layerID);
                closeThem();
                if (thisLayer.className == 'subnavDefault'){
                    thisLayer.className = 'subnavToggled';
                    categoryLink.className = 'leftnavLinkSelectedSection';
                }
            }
        }
        function closeThem(){
            for(x = 0; x < numDivs; x++){
                theLayer = document.getElementById(layerName + (x
+ 1));
                thecategoryLink = document.getElementById(catLinkName + (x + 1));
                theLayer.className = 'subnavDefault';
                thecategoryLink.className = 'leftnavLink';
            }
        } var flag = 0; var lastClicked = 0
    //-->
    </script>

它甚至可以使用在线Java正则表达式工具循环(例如 www.fileformat.info/tool/regex.htm )或像 RegexBuddy

it even keeps looping with an online Java regex tool (such as www.fileformat.info/tool/regex.htm) or a utility like RegexBuddy.

推荐答案

Java正则表达式引擎崩溃的原因是正则表达式的这一部分导致堆栈溢出(确实!):

The reason the Java regex engine crashes is that this part of your regex causes a stack overflow (indeed!):

[\s]|[^<]

这里发生的是,与\s匹配的每个字符也可以通过[^< ]。这意味着有两种方法可以匹配每个空白字符。如果我们用A和B代表两个字符类:

What happens here is that every character matched by \s can also be matched by [^<]. That means there are two ways to match each whitespace character. If we represent the two character classes with A and B:

A|B

然后可以将三个空格的字符串匹配为AAA,AAB,ABA,ABB,BAA,BAB,BBA或BBB。换句话说,这部分正则表达式的复杂性是2 ^ N.这会杀死任何对我所谓的没有任何保护措施的正则表达式引擎灾难性的回溯

Then a string of three spaces could be matched as AAA, AAB, ABA, ABB, BAA, BAB, BBA, or BBB. In other words the complexity of this part of the regex is 2^N. This will kill any regex engine that doesn't have any safeguards against what I call catastrophic backtracking.

在正则表达式中使用交替(垂直条)时,请务必确保备选方案是互斥的。也就是说,最多可以允许其中一个替代方案匹配任何给定的文本位。

When using alternation (vertical bar) in a regex, always make sure the alternatives are mutually exclusive. That is, at most one of the alternatives may be allowed to match any given bit of text.

这篇关于为什么这个正则表达式会杀死Java正则表达式引擎?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆