查找给定两个字符串的所有常见子字符串 [英] Finding all the common substrings of given two strings

查看:104
本文介绍了查找给定两个字符串的所有常见子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题陈述,找到给定的两个子字符串之间的所有公共子字符串,这样在每种情况下都必须打印最长的子字符串。问题陈述如下:

I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows:


编写程序以查找两个给定字符串之间的公共子字符串。但是,不要包含较长公共子字符串中包含的子字符串。

Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings.

例如,给定输入字符串 eatsleepnightxyz eatsleepabcxyz ,结果应为:

For example, given the input strings eatsleepnightxyz and eatsleepabcxyz, the results should be:


  • eatsleep (由于 eatsleep nightxyz eatsleep abcxyz

  • xyz (由于 eatsleepnight xyz eatsleepabc xyz

  • a (由于 e a tsleepnightxyz eatsleep a bcxyz

  • t (由于 eatsleepnigh t xyz ea t sleepabcxyz

  • eatsleep (due to eatsleepnightxyz eatsleepabcxyz)
  • xyz (due to eatsleepnightxyz eatsleepabcxyz)
  • a (due to eatsleepnightxyz eatsleepabcxyz)
  • t (due to eatsleepnightxyz eatsleepabcxyz)

但是,结果集包括 e 来自
e atsleepnightxyz eatsl e epabcxyz ,因为 e 已经是包含在上面提到的 eatsleep 中。你也不应该包括 ea ats 等等。,因为这些也都被 eatsleep 所涵盖。

However, the result set should not include e from eatsleepnightxyz eatsleepabcxyz, because both es are already contained in the eatsleep mentioned above. Nor should you include ea, eat, ats, etc., as those are also all covered by eatsleep.

在此,你不必使用String实用程序方法如:contains,indexOf,StringTokenizer,split和replace。

In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace.

我的算法如下:我从粗暴开始当我提高基本理解时,强制并将切换到更优化的解决方案。

My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding.

 For String S1:
     Find all the substrings of S1 of all the lengths
     While doing so: Check if it is also a substring of 
     S2.

尝试弄清楚我的方法的时间复杂性。

Attempt to figure out the time complexity of my approach.

让两个给定的字符串为n1-String和n2-String


  1. S1的子串显然是n1(n1 + 1)/ 2。

  2. 但是我们必须找到S1的子串的平均长度。

  3. 让我们说它是m。我们将分别找到m。

  4. 时间复杂度检查m-String是否是
    n-String的子字符串是O(n * m)。

  5. 现在,我们正在检查每个m-String是S2的子串,
    是一个n2-String。

  6. 正如我们上面所看到的,这是一个O(n 2 m)算法。

  7. 所需时间通过整体算法然后是

  8. Tn =(S1中的子串数)*(字符比较过程的平均子串长度时间)

  9. 通过执行确定我得出结论,
    时间复杂度为O(n 3 m 2

  10. 现在,我们的工作是找到m的n1。

  1. The number of substrings of S1 is clearly n1(n1+1)/2.
  2. But we have got to find the average length a substring of S1.
  3. Let’s say it is m. We’ll find m separately.
  4. Time Complexity to check whether an m-String is a substring of an n-String is O(n*m).
  5. Now, we are checking for each m-String is a substring of S2, which is an n2-String.
  6. This, as we have seen above, is an O(n2 m) algorithm.
  7. The time required by the overall algorithm then is
  8. Tn=(Number of substrings in S1) * (average substring lengthtime for character comparison procedure)
  9. By performing certain calculations, I came to conclusion that the time complexity is O(n3 m2)
  10. Now, our job is to find m in terms of n1.

尝试用n1来找到m。

T n =(n)(1)+(n-1)(2)+(n-2)(3)+。 .... +(2)(n-1)+(1)(n)

其中T n 是所有子串的长度之和。

Tn = (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
where Tn is the sum of lengths of all the substrings.

平均值是该总和除以生成的子串总数。

Average will be the division of this sum by the total number of Substrings produced.

这只是一个总和,划分问题,其解决方案如下:O(n)

This, simply is a summation and division problem whose solution is as follows O(n)

因此......

算法的运行时间为O(n ^ 5)。

Running time of my algorithm is O(n^5).

考虑到这一点,我编写了以下代码:

With this in mind I wrote the following code:

 package pack.common.substrings;

 import java.util.ArrayList;
 import java.util.LinkedHashSet;
 import java.util.List;
 import java.util.Set;

 public class FindCommon2 {
    public static final Set<String> commonSubstrings = new      LinkedHashSet<String>();

 public static void main(String[] args) {
    printCommonSubstrings("neerajisgreat", "neerajisnotgreat");
    System.out.println(commonSubstrings);
}

 public static void printCommonSubstrings(String s1, String s2) {
    for (int i = 0; i < s1.length();) {
        List<String> list = new ArrayList<String>();
        for (int j = i; j < s1.length(); j++) {
            String subStr = s1.substring(i, j + 1);
            if (isSubstring(subStr, s2)) {
                list.add(subStr);
            }
        }
        if (!list.isEmpty()) {
            String s = list.get(list.size() - 1);
            commonSubstrings.add(s);
            i += s.length();
        }
    }
 }

 public static boolean isSubstring(String s1, String s2) {
    boolean isSubstring = true;
    int strLen = s2.length();
    int strToCheckLen = s1.length();
    if (strToCheckLen > strLen) {
        isSubstring = false;
    } else {
        for (int i = 0; i <= (strLen - strToCheckLen); i++) {
            int index = i;
            int startingIndex = i;
            for (int j = 0; j < strToCheckLen; j++) {
                if (!(s1.charAt(j) == s2.charAt(index))) {
                    break;
                } else {
                    index++;
                }
            }
            if ((index - startingIndex) < strToCheckLen) {
                isSubstring = false;
            } else {
                isSubstring = true;
                break;
            }
        }
    }
    return isSubstring;
 }
}

我的代码说明:

 printCommonSubstrings: Finds all the substrings of S1 and 
                        checks if it is also a substring of 
                        S2.
 isSubstring : As the name suggests, it checks if the given string 
               is a substring of the other string.

问题:给定输入

  S1 = "neerajisgreat";
  S2 = "neerajisnotgreat"
  S3 = "rajeatneerajisnotgreat"

如果是S1和S2,输出应该是: neerajis 很好
但是在S1和S3的情况下,输出应该是:
neerajis raj 很棒但我仍然得到 neerajis 很棒作为输出。我需要弄清楚这一点。

In case of S1 and S2, the output should be: neerajis and great but in case of S1 and S3, the output should have been: neerajis, raj, great, eat but still I am getting neerajis and great as output. I need to figure this out.

我应该如何设计我的代码?

How should I design my code?

推荐答案

使用适当的算法算法而不是蛮力方法会更好。维基百科描述了最常见的子字符串问题的两种常见解决方案:后缀树动态编程

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.

动态编程解决方案需要O( nm )时间和O( nm )空间。对于最长的公共子字符串,这几乎是对Wikipedia伪代码的直接Java翻译:

The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

public static Set<String> longestCommonSubstrings(String s, String t) {
    int[][] table = new int[s.length()][t.length()];
    int longest = 0;
    Set<String> result = new HashSet<>();

    for (int i = 0; i < s.length(); i++) {
        for (int j = 0; j < t.length(); j++) {
            if (s.charAt(i) != t.charAt(j)) {
                continue;
            }

            table[i][j] = (i == 0 || j == 0) ? 1
                                             : 1 + table[i - 1][j - 1];
            if (table[i][j] > longest) {
                longest = table[i][j];
                result.clear();
            }
            if (table[i][j] == longest) {
                result.add(s.substring(i - longest + 1, i + 1));
            }
        }
    }
    return result;
}

现在,您需要所有常见的子串,而不仅仅是最长的子串。您可以增强此算法以包含更短的结果。让我们检查表格中的示例输入 eatsleepnightxyz eatsleepabcxyz

Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:

  e a t s l e e p a b c x y z
e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 3




  • eatsleep 结果很明显:这是左上角的 12345678 对角条纹。

  • xyz 结果是右下角的 123 对角线。

  • a 结果由顶部附近的 1 表示(第二行,第九列)。

  • t 结果由左下角附近的 1 表示。

    • The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
    • The xyz result is the 123 diagonal at the bottom-right.
    • The a result is indicated by the 1 near the top (second row, ninth column).
    • The t result is indicated by the 1 near the bottom left.
    • 左边,顶部和<$ c旁边的其他 1 怎么样? $ c> 6 和 7 ?那些不计算,因为它们出现在由 12345678 对角线形成的矩形内 - 换句话说,它们已被 eatsleep

      What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.

      我建议除了构建表之外什么都不做。然后,进行第二次传递,从右下角向后迭代,以收集结果集。

      I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

      这篇关于查找给定两个字符串的所有常见子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆