高效的算法字符串连接与重叠 [英] Efficient Algorithm for String Concatenation with Overlap

查看:166
本文介绍了高效的算法字符串连接与重叠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要通过串联结合3列在数据库中。然而,3列可能包含重叠部分和部件不应重复。例如,

 一+B+C=> ABC
  ABCDE+defgh+ghlmn=> abcdefghlmn
  abcdede+dedefgh+=> abcdedefgh
  ABCDE+D+ghlmn=> abcdedghlmn
  ABCDEF++defghl=> abcdefghl
 

我们目前的算法是pretty的慢,因为它使用蛮力,以确定两个字符串之间的重叠部分。有没有人知道一个高效的算法来做到这一点?

假设我们有两个字符串A和B的算法需要找到最长的公共子串S以便A结束与S和B开始以S

我们的目前的强力实施的Java连接仅供参考,

 公共静态使用concat(字符串S1,S2字符串){
如果(S1 == NULL)
返回S2;
如果(S2 == NULL)
返回S1;
INT的len = Math.min(s1.length(),s2.length());

//找到索引重叠部分的端部
INT指数= -1;
的for(int i = LEN; I> 0;我 - ){
字符串的子串= s2.substring(0,I);
如果(s1.endsWith(子串)){
指数=我;
			打破;
}
}
StringBuilder的SB =新的StringBuilder(S1);
如果(指数< 0)
sb.append(S2);
否则,如果(指数< = s2.length())
sb.append(s2.substring(指数));
返回sb.toString();
}
 

解决方案

其他大多数答案都集中在恒定因素的优化,但它也有可能做的渐近更好。看看你的算法:它是O(N ^ 2)。这似乎是可以解决的比这更快的问题!

考虑克努特莫里斯普拉特。它跟踪的子字符串,我们已经匹配,到目前为止整个的最高金额。这意味着它知道多少S1已经匹配的在S2 的结束时,这就是我们正在寻找的价值!只需修改算法继续,而不是当它早在串匹配返回,并将它返回匹配的,而不是0的量在端

这是给你一个O(n)的算法。不错!

  INT OverlappedStringLength(字符串S1,S2线){
        //修剪S1,所以它不是比s2更长
        如果(s1.Length> s2.Length)S1 = s1.Substring(s1.Length  -  s2.Length);

        INT [] T = ComputeBackTrackTable(S2)​​; //上)

        INT米= 0;
        INT I = 0;
        而(M + 1&所述; s1.Length){
            如果(S2 [我] == S1 [M + 1]){
                我+ = 1;
                //<  - 删除返回这里的情况,因为| S | < = | S2 |
            } 其他 {
                M + = I  -  T [I];
                如果(ⅰ大于0)I = T [i]于;
            }
        }

        返回我; //<  - 改回到这里返回匹配的字符
    }

    INT [] ComputeBackTrackTable(字符串s){
        变种T =新INT [s.Length]
        INT CND = 0;
        T [0] = 1;
        T [1] = 0;
        INT POS = 2;
        而(POS&L​​T; s.Length){
            如果(S [POS  -  1] == S [来电显示]){
                T [POS] = CND + 1;
                POS + = 1;
                CND + = 1;
            }否则如果(CND> 0){
                CND = T [来电显示]
            } 其他 {
                T [POS] = 0;
                POS + = 1;
            }
        }

        返回笔;
    }
 

OverlappedStringLength(ABCDEF,defghl)返回3

We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,

  "a" + "b" + "c" => "abc"
  "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
  "abcdede" + "dedefgh" + "" => "abcdedefgh"
  "abcde" + "d" + "ghlmn" => "abcdedghlmn"
  "abcdef" + "" + "defghl" => "abcdefghl"

Our current algorithm is pretty slow because it uses brute-force to identify the overlapping part between 2 strings. Does any one know an efficient algorithm to do this?

Say we have 2 strings A and B. The algorithm needs to find the longest common substring S so that A ends with S and B starts with S.

Our current brute-force implementation in Java is attached for reference,

public static String concat(String s1, String s2) {
	if (s1 == null)
		return s2;
	if (s2 == null)
		return s1;
	int len = Math.min(s1.length(), s2.length());

	// Find the index for the end of overlapping part
	int index = -1;
	for (int i = len; i > 0; i--) {
		String substring = s2.substring(0, i);
		if (s1.endsWith(substring)) {
			index = i;
			break;
		}
	}
	StringBuilder sb = new StringBuilder(s1);
	if (index < 0) 
		sb.append(s2);
	else if (index <= s2.length())
		sb.append(s2.substring(index));
	return sb.toString();
}

解决方案

Most of the other answers have focused on constant-factor optimizations, but it's also possible to do asymptotically better. Look at your algorithm: it's O(N^2). This seems like a problem that can be solved much faster than that!

Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that's the value we're looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.

That gives you an O(n) algorithm. Nice!

    int OverlappedStringLength(string s1, string s2) {
        //Trim s1 so it isn't longer than s2
        if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);

        int[] T = ComputeBackTrackTable(s2); //O(n)

        int m = 0;
        int i = 0;
        while (m + i < s1.Length) {
            if (s2[i] == s1[m + i]) {
                i += 1;
                //<-- removed the return case here, because |s1| <= |s2|
            } else {
                m += i - T[i];
                if (i > 0) i = T[i];
            }
        }

        return i; //<-- changed the return here to return characters matched
    }

    int[] ComputeBackTrackTable(string s) {
        var T = new int[s.Length];
        int cnd = 0;
        T[0] = -1;
        T[1] = 0;
        int pos = 2;
        while (pos < s.Length) {
            if (s[pos - 1] == s[cnd]) {
                T[pos] = cnd + 1;
                pos += 1;
                cnd += 1;
            } else if (cnd > 0) {
                cnd = T[cnd];
            } else {
                T[pos] = 0;
                pos += 1;
            }
        }

        return T;
    }

OverlappedStringLength("abcdef", "defghl") returns 3

这篇关于高效的算法字符串连接与重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆