如何在字符串集合中有效地找到指定长度的相同子字符串? [英] How to efficiently find identical substrings of a specified length in a collection of strings?

查看:45
本文介绍了如何在字符串集合中有效地找到指定长度的相同子字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个集合 S,通常包含 10-50 个长字符串.为便于说明,假设每个字符串的长度在 1000 到 10000 个字符之间.

I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.

我想找到指定长度 k(通常在 5 到 20 的范围内)的字符串,这些字符串是 S 中每个字符串的子字符串.这显然可以使用一种简单的方法来完成 - 枚举 S[0] 中的每个 k 长度子串并检查它们是否存在于 S 的每个其他元素中.

I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.

是否有更有效的方法来解决这个问题?据我所知,这个问题和最长公共子序列问题有一些相似之处,但我对 LCS 的理解是有限的,我不确定它如何适应我们将所需的公共子串长度绑定到的情况k,或者是否可以应用子序列技术来查找子串.

Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.

推荐答案

这是一个相当简单的算法,应该相当快.

Here's one fairly simple algorithm, which should be reasonably fast.

  1. 使用 滚动哈希,如Rabin-Karp 字符串搜索算法,构建哈希表H0|S0|-k+1 长度 kS0 的子串的 sub>.这大约是 O(|S0|) 因为每个散列都是从前一个散列以 O(1) 计算的,但是如果存在冲突或重复的子串,则需要更长的时间.使用更好的散列将帮助您解决冲突,但如果 S0 中有很多 k 长度的重复子字符串,那么您最终可能会使用 O(k|S0|).

  1. Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H0 of all the |S0|-k+1 length k substrings of S0. That's roughly O(|S0|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S0 then you could end up using O(k|S0|).

现在在 S1 上使用相同的滚动哈希.这一次,在 H0 中查找每个子串,如果找到,将其从 H0 和将其插入到一个新表 H1 中.同样,这应该在 O(|S1|) 附近,除非你有一些病理情况,比如 S0S1 只是相同字符的长重复.(如果 S0S0 是相同的字符串,或者有很多重叠部分.)

Now use the same rolling hash on S1. This time, look each substring up in H0 and if you find it, remove it from H0 and insert it into a new table H1. Again, this should be around O(|S1|) unless you have some pathological case, like both S0 and S1 are just long repetitions of the same character. (It's also going to be suboptimal if S0 and S0 are the same string, or have lots of overlapping pieces.)

对每个 Si 重复第 2 步,每次创建一个新的哈希表.(在第 2 步的每次迭代结束时,您可以删除上一步的哈希表.)

Repeat step 2 for each Si, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)

最后,最后一个哈希表将包含所有常见的k长度的子串.

At the end, the last hash table will contain all the common k-length substrings.

总运行时间应该是 O(Σ|Si|) 但在最坏的情况下它可能是 O(kΣ|Si|).即便如此,对于所描述的问题规模,它应该会在可接受的时间内运行.

The total run time should be about O(Σ|Si|) but in the worst case it could be O(kΣ|Si|). Even so, with the problem size as described, it should run in acceptable time.

这篇关于如何在字符串集合中有效地找到指定长度的相同子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆