发现在字符串出现一个子序列的数目 [英] Find the number of occurrences of a subsequence in a string

查看:117
本文介绍了发现在字符串出现一个子序列的数目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,让字符串是第10位圆周率, 3141592653 ,和子是 123 。注意序列发生两次:

For example, let the string be the first 10 digits of pi, 3141592653, and the subsequence be 123. Note that the sequence occurs twice:

3141592653
 1    2  3
   1  2  3

这是一个面试问题,我不能回答,我想不出一个高效的算法,它的窃听我。我觉得这应该可以做一个简单的正则表达式,但那些象 1 * 2 * 3 不返回每个子序列。在Python我幼稚的(算上3的每个经过2各1个)已经运行了一个小时,也没有这样做。

This was an interview question that I couldn't answer and I can't think of an efficient algorithm and it's bugging me. I feel like it should be possible to do with a simple regex, but ones like 1.*2.*3 don't return every subsequence. My naive implementation in Python (count the 3's for each 2 after each 1) has been running for an hour and it's not done.

推荐答案

这是一个经典的动态规划问题(通常不解决使用正则EX pressions)。

This is a classical dynamic programming problem (and not typically solved using regular expressions).

我的幼稚的做法(算上3的每个经过2各1个)已经运行了一个小时,也没有这样做。

My naive implementation (count the 3's for each 2 after each 1) has been running for an hour and it's not done.

这将是它运行在指数时间穷举搜索的方法。 (我很惊讶,它虽然运行几个小时)。

That would be an exhaustive search approach which runs in exponential time. (I'm surprised it runs for hours though).

下面是一个建议一个动态规划的解决方案:

Here's a suggestion for a dynamic programming solution:

(道歉了详细描述,但每一步都是非常简单所以忍耐; - )

(Apologies for the long description, but each step is really simple so bear with me ;-)

  • 如果在是空的找到匹配的(没有剩余的数字匹配!),我们返回1

  • If the subsequence is empty a match is found (no digits left to match!) and we return 1

如果在输入序列是空的,我们已经耗尽了我们的数字,也不可能找到匹配因此,我们回到0

If the input sequence is empty we've depleted our digits and can't possibly find a match thus we return 0

(无论是顺序还是子是空的。)

(Neither the sequence nor the subsequence are empty.)

(假设的 ABCDEF 表示输入序列,和 XYZ 表示该序列。)

(Assume that "abcdef" denotes the input sequence, and "xyz" denotes the subsequence.)

设置结果 0

添加到结果比赛的人数为 BCDEF XYZ (即丢弃的第一个输入数字和递归)

Add to the result the number of matches for bcdef and xyz (i.e., discard the first input digit and recurse)

如果前两位匹配,也就是说,的 = X

If the first two digits match, i.e., a = x

  • 添加到结果比赛对于 BCDEF YZ (即匹配第一个序列数字和递归在剩余的序列数字)

返回结果

这里的递归调用输入的说明1221 / 12 。 (子序列粗体,·重新presents空字符串)

Here's an illustration of the recursive calls for input 1221 / 12. (Subsequence in bold font, · represents empty string.)

如果实施天真,一些(子)问题都解决了多次。动态编程通过记住(通常在一个查找表),从previously解决的子问题的结果避免了这样冗余计算

If implemented naively, some (sub-)problems are solved multiple times (· / 2 for instance in the illustration above). Dynamic programming avoids such redundant computations by remembering the results from previously solved subproblems (usually in a lookup table).

在这种特殊情况下,我们成立了一个表,

In this particular case we set up a table with

  • [序列的长度+ 1]行,
  • [序列的长度+ 1]列:

          

          

我们的想法是,我们应填写在相应的行/列匹配的221 / 2 的数量。一旦这样做,我们应该有最终的解决方案,在小区1221 / 12

The idea is that we should fill in the number of matches for 221 / 2 in the corresponding row / column. Once done, we should have the final solution in cell 1221 / 12.

我们开始填充表与我们立刻知道(基本情况):

We start populating the table with what we know immediately (the "base cases"):

  • 在当前没有子位被留下,我们有1完全吻合:

          

          

  • 在没有序列数字是离开了,我们不能有任何的比赛:

  • When no sequence digits are left, we can't have any matches:

我们然后继续通过填充表自上而下/左到右,根据以下规则:

We then proceed by populating the table top-down / left-to-right according to the following rule:

  • 在单元格[的] [ COL 的]写在发现价值[的-1] [COL]。

  • In cell [row][col] write the value found at [row-1][col].

直观地看,这意味着的火柴数221 / 2 包括了所有的比赛为21 / 2

Intuitively this means "The number of matches for 221 / 2 includes all the matches for 21 / 2."

如果序列行的的和subseq在列的 COL 的开始与相同的数字,加上在[中值的行的-1] [ COL 的-1]刚才写入[值的的] [ COL 的。

If sequence at row row and subseq at column col start with the same digit, add the value found at [row-1][col-1] to the value just written to [row][col].

直观地看这个手段的火柴数量1221 / 12 还包括所有的比赛为221 / 12

Intuitively this means "The number of matches for 1221 / 12 also includes all the matches for 221 / 12."

          

          

最终的结果如下所示:

          

          

和在底部右单元的值的确2

and the value at the bottom right cell is indeed 2.

不是在Python,(我的道歉)。

Not in Python, (my apologies).

class SubseqCounter {

    String seq, subseq;
    int[][] tbl;

    public SubseqCounter(String seq, String subseq) {
        this.seq = seq;
        this.subseq = subseq;
    }

    public int countMatches() {
        tbl = new int[seq.length() + 1][subseq.length() + 1];

        for (int row = 0; row < tbl.length; row++)
            for (int col = 0; col < tbl[row].length; col++)
                tbl[row][col] = countMatchesFor(row, col);

        return tbl[seq.length()][subseq.length()];
    }

    private int countMatchesFor(int seqDigitsLeft, int subseqDigitsLeft) {
        if (subseqDigitsLeft == 0)
            return 1;

        if (seqDigitsLeft == 0)
            return 0;

        char currSeqDigit = seq.charAt(seq.length()-seqDigitsLeft);
        char currSubseqDigit = subseq.charAt(subseq.length()-subseqDigitsLeft);

        int result = 0;

        if (currSeqDigit == currSubseqDigit)
            result += tbl[seqDigitsLeft - 1][subseqDigitsLeft - 1];

        result += tbl[seqDigitsLeft - 1][subseqDigitsLeft];

        return result;
    }
}


复杂

一个奖金为这个填补台底的做法是,它是微不足道弄清楚复杂性。工作的一个恒定的量为每一个细胞完成的,而且我们有长度序列的行数和长度,子列。复杂性是为此的 O(MN)的,其中的 M N 的表示序列的长度。


Complexity

A bonus for this "fill-in-the-table" approach is that it is trivial to figure out complexity. A constant amount of work is done for each cell, and we have length-of-sequence rows and length-of-subsequence columns. Complexity is therefor O(MN) where M and N denote the lengths of the sequences.

这篇关于发现在字符串出现一个子序列的数目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆