找出两个Glob模式(或正则表达式)的匹配项是否相交的算法 [英] Algorithm to find out whether the matches for two Glob patterns (or Regular Expressions) intersect

查看:79
本文介绍了找出两个Glob模式(或正则表达式)的匹配项是否相交的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找与 Redis KEYS命令接受的相匹配的全局样式。报价:

I'm looking at matching glob-style patterns similar the what the Redis KEYS command accepts. Quoting:



  • h?llo匹配hello,hello和hxllo

  • h * llo匹配hllo和heeeello

  • h [ae] llo匹配hello和hello,但不匹配hillo

但是我不匹配文本字符串,而是将模式与另一个模式匹配,所有运算符在两端都是有意义的。

But I am not matching against a text string, but matching the pattern against another pattern with all operators being meaningful on both ends.

例如,这些模式应该在同一行中彼此匹配:

For example these patterns should match against each other in the same row:

prefix*       prefix:extended*
*suffix       *:extended:suffix
left*right    left*middle*right
a*b*c         a*b*d*b*c
hello*        *ok
pre[ab]fix*   pre[bc]fix*

这些不应该匹配:

prefix*       wrong:prefix:*
*suffix       *suffix:wrong
left*right    right*middle*left
pre[ab]fix*   pre[xy]fix*
?*b*?         bcb

所以我想知道...


  • 如果可能的话(实施验证算法),如果有的话?

  • 如果不可能的话,正则表达式的哪些子集是可能的? (即禁止*通配符?)

  • 如果确实可行,什么是有效的算法?

  • 所需的时间复杂度是多少?
  • if this is possible to do (implement a verification algorithm), if at all?
  • if not possible, what subset of regex would be possible? (i.e. disallow * wildcard?)
  • if it is indeed possible, what is an efficient algorithm?
  • what are the time complexity required?

编辑:找到有关RegEx子集的另一个问题,但这与 hello * * ok 匹配的单词不是子集/

Found this other question on RegEx subset but this is not exactly the same as the words that hello* and *ok matches is not a subset/superset of each other but they do intersect.

所以我从数学上讲,这可能表述为;是否有可能确定性地检查一个模式匹配的单词集与另一个模式匹配的单词集相交是否导致非空集?

So I guess mathematically, this might be phrased as; is it possible to deterministically check that a set of words that one pattern match, intersecting with a set of words that another pattern matches, result in a non-empty set?

编辑::一个朋友 @neizod 消除表可以清晰地可视化可能的解决方案/部分解决方案:消除规则

A friend @neizod drew up this elimination table which neatly visualize what might be a potential/partial solution: Elimination rule

编辑:将为那些还可以提供工作代码(任何语言)和证明用例的人增加额外的赏金

Will adds extra bounty for those who can also provide working code (in any language) and test cases that proves it.

编辑:添加了?* b *?由@DanielGimenez在评论中发现的测试用例。

Added the ?*b*? test case discovered by @DanielGimenez in the comments.

推荐答案

现在见证这个全副武装的火力作战战斗台!

Now witness the firepower of this fully ARMED and OPERATIONAL battle station!

(我在这个答案上做得太多了,我的大脑坏了;应该有一个徽章

为了确定两个模式是否相交,我创建了一个递归回溯解析器-当 Kleene stars 会遇到一个新的堆栈,因此如果将来失败,则所有内容都会回滚,并且 star 会占用下一个字符。

In order to determine if two patterns intersect, I have created a recursive backtracking parser -- when Kleene stars are encountered a new stack is created so that if it fails in the future everything is rolled back and and the star consumes the next character.

您可以查看此答案的历史,以确定如何得出所有答案以及为什么这样做的必要性,但基本上,仅靠向前看并不足以确定交叉点

You can view the history of this answer to determine how arrived at all this and why it was necessary, but basically it wasn't sufficient to determine an intersection by looking ahead only one token, which was what I was doing before.

这种情况打破了旧答案 [abcd] d => * d 设置 之后的 d 匹配,因此左侧仍将保留令牌,而右侧将是完整的。但是,这两个模式在 ad bd cd dd ,因此需要解决。我的几乎 O(N)答案被抛出。

This was the case that broke the old answer [abcd]d => *d. The set matches the d after the star, so the left side would still have tokens remaining, while the right side would be complete. However, these patterns two intersect on ad, bd, cd and dd, so that needed to be fixed. My almost O(N) answer was thrown out.

词法化过程很简单,除了过程会转义字符并删除多余的 。令牌分为 sets stars 通配符(?) 字符 。这与我以前的版本不同,在我以前的版本中,一个令牌是一个字符串而不是一个字符。随着更多案例的出现,使用字符串作为令牌更多的是障碍而不是优势。

The lexing process is trivial, except that is processes escape characters and removes redundant stars. Tokens are broken out into sets, stars, wild character (?), and character. This is different than my previous versions where one token was a string of characters instead of a single character. As more cases come up, having strings as tokens was more of a hindrance than advantage.

大多数解析器的功能非常简单。给定左侧类型的开关调用一个函数,该函数确定适当的功能以将其与右侧类型进行比较。比较的结果使两个开关冒充到原始被调用者,通常是解析器的主循环。

Most of the functions of the parser are pretty trivial. A switch given the left side's type, calls a function that is a switch that determines the appropriate function to compare it with the right side's type. The result from the comparison bubbles up the two switches to the original callee, typically the main loop of the parser.

简单性以 star 结尾。遇到这种情况时,它将接管一切。首先,它将对方的下一个令牌与对方的下一个令牌进行比较,将对方推进,直到找到匹配项为止。

The simplicity ends with the star. When that is encountered it takes over everything. First it compares its side's next token with the other side's, advancing the other side until if finds a match.

找到匹配项后,它会检查所有匹配项是否都匹配直到两个模式的末尾。如果是这样,则图案相交。否则,它会从与之比较的原始令牌中推进对方的下一个令牌,并重复该过程。

Once the match is found, it then checks if everything matches all the way up to the end of both patterns. If it does then the patterns intersect. Otherwise, it advances the other side's next token from the original one it was compared against and repeats the process.

当两个 anys ,然后从彼此的下一个令牌开始进入自己的替代分支。

When two anys are encountered then the take off into their own alternative branches starting from each others' next token.

function intersects(left, right) {
    var lt, rt,
        result = new CompareResult(null, null, true);

    lt = (!left || left instanceof Token) ? left : tokenize(left);
    rt = (!right || right instanceof Token) ? right : tokenize(right);

    while (result.isGood && (lt || rt)) {
        result = tokensCompare(lt, rt);

        lt = result.leftNext;
        rt = result.rightNext;
    }

    return result;
}

function tokensCompare(lt, rt) {
    if (!lt && rt) return tokensCompare(rt, lt).swapTokens();

    switch (lt.type) {
        case TokenType.Char: return charCompare(lt, rt);
        case TokenType.Single: return singleCompare(lt, rt);
        case TokenType.Set: return setCompare(lt, rt);
        case TokenType.AnyString: return anyCompare(lt, rt);
    }
}

function anyCompare(tAny, tOther) {
    if (!tOther) return new CompareResult(tAny.next, null);

    var result = CompareResult.BadResult;

    while (tOther && !result.isGood) {
        while (tOther && !result.isGood) {
            switch (tOther.type) {
                case TokenType.Char: result = charCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.Single: result = singleCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.Set: result = setCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.AnyString:
                    // the anyCompare from the intersects will take over the processing.
                    result = intersects(tAny, tOther.next);
                    if (result.isGood) return result;
                    return intersects(tOther, tAny.next).swapTokens();
            }

            if (!result.isGood) tOther = tOther.next;
        }

        if (result.isGood) {
            // we've found a starting point, but now we want to make sure this will always work.
            result = intersects(result.leftNext, result.rightNext);
            if (!result.isGood) tOther = tOther.next;
        }
    }

    // If we never got a good result that means we've eaten everything.
    if (!result.isGood) result = new CompareResult(tAny.next, null, true);

    return result;
}

function charCompare(tChar, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return charCharCompare(tChar, tOther); 
        case TokenType.Single: return new CompareResult(tChar.next, tOther.next);
        case TokenType.Set: return setCharCompare(tOther, tChar).swapTokens(); 
        case TokenType.AnyString: return anyCompare(tOther, tChar).swapTokens();
    }
}

function singleCompare(tSingle, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.Single: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.Set: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.AnyString: return anyCompare(tOther, tSingle).swapTokens();
    }
}
function setCompare(tSet, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return setCharCompare(tSet, tOther);
        case TokenType.Single: return new CompareResult(tSet.next, tOther.next);
        case TokenType.Set: return setSetCompare(tSet, tOther);
        case TokenType.AnyString: return anyCompare(tOther, tSet).swapTokens();
    }
}

function anySingleCompare(tAny, tSingle) {
    var nextResult = (tAny.next) ? singleCompare(tSingle, tAny.next).swapTokens() :
        new CompareResult(tAny, tSingle.next);
    return (nextResult.isGood) ? nextResult: new CompareResult(tAny, tSingle.next);
}

function anyCharCompare(tAny, tChar) {
    var nextResult = (tAny.next) ? charCompare(tChar, tAny.next).swapTokens() :
        new CompareResult(tAny, tChar.next);

    return (nextResult.isGood) ? nextResult : new CompareResult(tAny, tChar.next);
}

function charCharCompare(litA, litB) {
    return (litA.val === litB.val) ?
        new CompareResult(litA.next, litB.next) : CompareResult.BadResult;
}

function setCharCompare(tSet, tChar) {
    return (tSet.val.indexOf(tChar.val) > -1) ?
        new CompareResult(tSet.next, tChar.next) : CompareResult.BadResult;
}

function setSetCompare(tSetA, tSetB) {
    var setA = tSetA.val,
        setB = tSetB.val;

    for (var i = 0, il = setA.length; i < il; i++) {
        if (setB.indexOf(setA.charAt(i)) > -1) return new CompareResult(tSetA.next, tSetB.next);
    }
    return CompareResult.BadResult;
}



jsFiddle



时间复杂度



任何带有递归回溯中至少为O(N 2 )。

我故意用单个开关将所有分支分解成各自的功能。通常,当一个字符串就足够时,我使用命名常量。这样做使代码更长,更冗长,但我认为它使遵循起来更容易。

I purposely broke out any branches into there own functions with a singular switch. Assitionally I used named constants when a one character string would suffice. Doing this made the code longer and more verbose, but I think it makes it easier to follow.

您可以在小提琴中查看所有测试。您可以在Fiddle输出中查看注释以了解其目的。每种令牌类型都针对每种令牌类型进行了测试,但我还没有做过一次可以在所有测试中进行所有可能比较的尝试。我还想出了一些像下面这样的随机性强项。

You can view all the tests in the Fiddle. You can view the comments in the Fiddle output to glean their purposes. Each token type was tested against each token type, but I haven't made one that tried all possible comparisons in a single test. I also came up with a few random tough ones like the one below.

abc [def]?fghi?* nop * [tuv] uv [wxy]?yz => a?[cde] defg *?ilmn [opq] * tu * [xyz] *

我在 jsFiddle ,如果有人想自己测试一下。一旦添加递归,日志记录就会中断。

I added an interface on the jsFiddle if anybody wants to test this out themselves. The logging is broken once I added the recursion.

我认为我没有进行足够的负面测试,尤其是在我创建的最后一个版本中。

I don't think I tried enough negative tests, especially with the last version I created.

当前的解决方案是蛮力的,但足以应付任何情况。我想回到这一点,通过一些简单的优化来提高时间复杂度。

Currently the solution is a brute force one, but is sufficient to handle any case. I would like to come back to this at some point to improve the time complexity with some simple optimizations.

开始检查以减少比较可能会增加某些常见问题的处理时间场景。例如,如果一种模式以 star 开头,然后以一种模式结束,则我们已经知道它们将相交。我还可以检查模式开头和结尾的所有字符,如果两个模式都匹配,则将其删除。这样,它们便不再包含在将来的递归中。

Checks at the start to reduce comparisons could increase processing time for certain common scenarios. For example, if one pattern starts with a star and one ends with one then we already know they will intersect. I can also check all the characters from the start and end of the patterns and remove them if the match on both patterns. This way they are excluded from any future recursion.

我最初使用 @ m.buettner的测试来测试我的代码在我提出自己的想法之前。另外,我还浏览了他的代码,以帮助我更好地理解问题。

I used @m.buettner's tests initially to test my code before I came up with my own. Also I walked through his code to help me understand the problem better.

这篇关于找出两个Glob模式(或正则表达式)的匹配项是否相交的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆