使用动态编程了解正则表达式字符串匹配 [英] Understanding regex string matching using Dynamic Programming

查看:77
本文介绍了使用动态编程了解正则表达式字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,要求您实现支持'。'和'*'的正则表达式匹配器,其中

I came across this problem that asks you to implement a regular expression matcher with support for '.' and '*', where

'。'匹配任何单个

'*'匹配零个或多个前一个元素。

'*' Matches zero or more of the preceding element.

isMatch("aa","a") → false
isMatch("aa","aa") → true
isMatch("aaa","aa") → false
isMatch("aa", "a*") → true
isMatch("aa", ".*") → true
isMatch("ab", ".*") → true
isMatch("aab", "c*a*b") → true

能够以线性方式解决此问题,我遇到了很多使用DP的解决方案,例如下面的解决方案,

While I was able to solve this in a linear fashion, I came across lots of solutions that used DP, like the one below,

class Solution {
    public boolean isMatch(String text, String pattern) {
        boolean[][] dp = new boolean[text.length() + 1][pattern.length() + 1];
        dp[text.length()][pattern.length()] = true;

        for (int i = text.length(); i >= 0; i--){
            for (int j = pattern.length() - 1; j >= 0; j--){
                boolean first_match = (i < text.length() && 
                                       (pattern.charAt(j) == text.charAt(i) ||
                                        pattern.charAt(j) == '.'));
                if (j + 1 < pattern.length() && pattern.charAt(j+1) == '*'){
                    dp[i][j] = dp[i][j+2] || first_match && dp[i+1][j];
                } else {
                    dp[i][j] = first_match && dp[i+1][j+1];
                }
            }
        }
        return dp[0][0];
    }
}

我很难理解这一点。我已经解决了一些涉及网格的DP问题(2d网格中的最短路径,二进制2d网格中的最大正方形),在那里使用了一个DP表格对我来说非常合理。但是,这里我完全迷失了方向,我无法理解遍历2D表如何解决此问题。更进一步地,我们似乎知道循环中的字符何时不匹配,所以我不明白为什么我们不在那里终止搜索(这也可能是由于我对表遍历如何导致的缺乏了解一个解法)。

I'm having a hard time understanding this. I've solved a few DP problems that involved grids (shortest path in 2d grid, largest square in binary 2d grid), using a DP table there made perfect sense to me. However here I'm completely lost, I'm unable to understand how traversing a 2d table helps in solving this problem. Further more it appears we know when the characters don't match in the loop, so I don't understand why we don't terminate the search there (this is also probably due to my lack of understanding of how a table traversal leads to a solution). Is there a clear intuitive explanation for problems like these?

推荐答案

使用DP解决这样的问题的直觉是找出答案对于以下问题

The intuition to solve problem like this using DP is to figure out answer to following questions


  1. 可以使用递归解决问题吗?

  2. 递归树中是否会重复出现较小的子问题?如果是,则可以通过以下方式存储较小问题的结果:只要遇到类似的子问题,就可以在O(1)中访问结果。通常将其称为记忆。

首先让我们了解问题的解决方案,当您以线性方式求解时会想出这一点。

Lets first understand the problem's solution which you would have figured out while solving in linear fashion.


  1. 同时将带有模式的文本与第一个字符匹配或不匹配。

  1. While matching the text with pattern either first character will match or it will not match.

情况1:第一个字符匹配或模式的第一个字符是'。

Case 1: First character matches or first character of pattern is '.'

情况1.1下一个字符是'*'

Case 1.1 Next character is '*'

案例1.2下一个字符不是'*'

Case 1.2 Next character is not '*'

案例2:第一个字符不匹配

Case 2: First character does not match

case 2.1下一个字符是'*'

case 2.1 Next character is '*'

case 2.2下一个字符不是'*'

case 2.2 Next character is not '*'

现在让我们找出前面针对上述问题讨论的两个步骤。

Lets now figure out two steps discussed earlier for above problem.

对于案例1.1,示例

isMatch( a, a * a) isMatch( aab, a * b)
等效于解决

isMatch("a", "a*a") and isMatch("aab", "a*b") which is equivalent to solving

isMatch( a, a )|| isMatch(, a * a)

isMatch( aab, b) || isMatch( ab, a * b) || 条件的第一部分涉及模式中可选字符的情况,后跟'*',第二部分涉及重复匹配字符的情况。

isMatch("aab", "b") || isMatch("ab", "a*b") respectively. First part of || condition covers the scenario of optional character in pattern as it is followed by '*' and second part covers the scenario for repetition of matched character.

在所有情况下都可以形成类似的子问题。情况2.2
立即返回false。

Similarly sub-problems can be formed for all the cases. case 2.2 should straightaway return false.

下面是具有递归方法的Java代码

Below is the java code with recursive approach

public boolean isMatch(String text, String pattern) {
    dp = new Boolean[text.length()][pattern.length()];
    return isMatch(text, pattern, 0, 0);
}

private boolean isMatch(String text, String pattern, int ti, int pi) {

    if (pi == pattern.length() && ti < text.length()) return false;

    if (ti == text.length() && pi == pattern.length()) return true;

    if (ti == text.length()) {
        return isNextCharStar(pattern, pi) && isMatch(text, pattern, ti, pi + 2);
    }


    boolean result;
    final boolean hasFirstMatched = text.charAt(ti) == pattern.charAt(pi) || pattern.charAt(pi) == '.';

    if (isNextCharStar(pattern, pi)) {
        result = isMatch(text, pattern, ti, pi + 2);
        if (hasFirstMatched) {
            result = result || isMatch(text, pattern, ti + 1, pi);
        }
        return result;
    }

    return hasFirstMatched && isMatch(text, pattern, ti + 1, pi + 1);

}

private boolean isNextCharStar(String pattern, int pi) {
    return pi < pattern.length() - 1 && pattern.charAt(pi + 1) == '*';
}

现在让我们应用记忆步骤。如果在返回之前开始保存结果,则下次可以重用。我们应该如何保存它?
对于 ti pi 的所有可能组合,我们必须存储结果。其中 ti 是文本索引,而 pi 是样式索引。因此,尺寸为 text.length * pattern.length 的二维数组就足够了。以下是经过上述更改后的代码

Now lets apply the memoization step. If we start saving the result before returning we can reuse it next time. How and where should we save it? For all possible combinations of ti and pi we have to store the result. Where ti is text Index and pi is pattern index. So a 2d array of size text.length * pattern.length should be sufficient. Following is the code after above changes

Boolean [][] dp;

public boolean isMatch(String text, String pattern) {
    dp = new Boolean[text.length()][pattern.length()];
    return isMatch(text, pattern, 0, 0);
}

private boolean isMatch(String text, String pattern, int ti, int pi) {

    if (pi == pattern.length() ) return ti == text.length();

    if (ti == text.length()) {
        return isNextCharStar(pattern, pi) && isMatch(text, pattern, ti, pi + 2);
    }

    if(dp[ti][pi] != null) return dp[ti][pi];

    boolean result;
    final boolean hasFirstMatched = text.charAt(ti) == pattern.charAt(pi) || pattern.charAt(pi) == '.';

    if (isNextCharStar(pattern, pi)) {
        result = isMatch(text, pattern, ti, pi + 2);
        if (hasFirstMatched) {
            result = result || isMatch(text, pattern, ti + 1, pi);
        }
        dp[ti][pi] = result;
        return result;
    }

    dp[ti][pi] = hasFirstMatched && isMatch(text, pattern, ti + 1, pi + 1);
    return dp[ti][pi];

}

private boolean isNextCharStar(String pattern, int pi) {
    return pi < pattern.length() - 1 && pattern.charAt(pi + 1) == '*';
}

如果您仔细观察,仅更改了3行以使其成为DP递归解决方案。

If you will look closely only 3 lines have been changed to make it a DP solution from a recursive solution.

这篇关于使用动态编程了解正则表达式字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆