在字符串中的两个序列最长共同子 [英] Longest common substring in two sequences of strings

查看：222 发布时间：2015/11/30 20:17:23 string algorithm data-structures

本文介绍了在字符串中的两个序列最长共同子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

说完刚刚学会的最长公共子串的算法，我很好奇这个问题的特定变体。它被描述如下： - ：

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:

给定两个非空字符串的序列中，X =（X1，X2，X3，...，X（N））和Y =（Y1，Y2，Y3，...，γ（米）），其中，x（i）和Y（I）是字符的字符串，找到最长串x，它是所有在Y串子串。

Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.

我有一个函数子（X，Y）返回布尔描绘x是否是y或不是一个字符串。显然，我必须串联Y中所有的字符串，形成一个大的字符串，比如，记为B.我认为以下途径 - ：

I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:

朴素：首先连接所有串x中，形成一个字符串A（N）。应用子（A（N），B） - 这包括迭代落后的字符串A（N）。如果为真，该算法在这里结束并返回A（N） - 或它的任何部分包括在所述子。如果没有，请继续申请（A（N - 1），B）等。如果这样的字符串无X存在，我返回空字符串。

Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.

显然，这种方法会占用相当长的运行时间取决于执行。假设我用迭代的方法，在每次迭代我将不得不向后遍历字符串在该水平/指数，并随后申请子（）。这将需要ATLEAST两个环，和 O（尺寸（B）*最大长度（X1，X2，...））最坏情况下的时间，或更大，取决于子（）（纠正我，如果错了）。

Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).

我基于后缀树/阵列想到了第二种方法。

I thought of a second approach based on suffix trees/arrays.

广义后缀树：我建立序列Y的使用Ukkonen的算法，商品及服务税的 O（最大长度（Y1，Y2，...） （？）。我缺乏后缀树知识的叮咬。我相信一个后缀树的方法将大大减少运行时间（以成本的空间）查找子串，但我不知道如何执行该操作。

Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...)(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.

如果有更好的办法，我很想知道。

If there is a better approach, I'd love to know.

编辑：如果这似乎是我抛弃了专题道歉

Apologies if it seemed like I abandoned the topic.

如果我要使用不是商品及服务税，但一些标准数据结构，如栈，队列，集堆，优先级队列等？序列X将必须进行排序，最大串首先，自然。如果我将其存储在一个字符串数组，我将不得不使用一个排序算法，如归并/快速排序。我们的目标是获得最有效的运行时间尽可能

What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.

我不能存放X系统在结构，因为它建立本身自动排序的元素呢？怎么样一个最大堆？

Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?

这似乎像后缀树是找到这种方式的子串的最佳途径。是否有任何其他的数据结构，我可以用？

It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?

在字符串中的两个序列最长共同子 [英] Longest common substring in two sequences of strings

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

在字符串中的两个序列最长共同子 [英] Longest common substring in two sequences of strings

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭