在 N 个字符串中查找公共子串的算法 [英] Algorithm to find common substring across N strings

查看:28
本文介绍了在 N 个字符串中查找公共子串的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我熟悉 2 个字符串的 LCS 算法.寻找在 2..N 字符串中查找公共子字符串的建议.每对中可能有多个公共子串.字符串的子集中可以有不同的公共子字符串.

I'm familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair. There can be different common substrings in subsets of the strings.

字符串:(ABCDEFGHIJKL) (DEF) (ABCDEF) (BIJKL) (FGH)

常用字符串:

1/2 (DEF)
1/3 (ABCDEF)
1/4 (IJKL)
1/5 (FGH)
2/3 (DEF)

最长的公共字符串:

1/3 (ABCDEF)

最常见的字符串:

1/2/3 (DEF)

推荐答案

这种事情在 DNA 序列分析中一直存在.您可以为它找到各种算法.此处列出了一个合理的集合.

This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.

还有为每个子串制作表格的蛮力方法(如果你只对短的感兴趣):形成一个 N-ary 树(字母 N=26,字母 256ASCII),并在每个节点存储计数的直方图.如果您删除很少使用的节点(以保持合理的内存需求),您最终会得到一个算法,该算法可以在 N*M^2*log(M) 时间中找到长度不超过 M 的所有子序列以输入长度N. 如果您改为将其拆分为 K 个单独的字符串,您可以构建树结构,只需通过树一次即可读出答案.

There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.

这篇关于在 N 个字符串中查找公共子串的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆