在字符串中的两个序列最长共同子 [英] Longest common substring in two sequences of strings

查看:222
本文介绍了在字符串中的两个序列最长共同子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说完刚刚学会的最长公共子串的算法,我很好奇这个问题的特定变体。它被描述如下: - :

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:

给定两个非空字符串的序列中,X =(X1,X2,X3,...,X(N))和Y =(Y1,Y2,Y3,...,γ(米)) ,其中,x(i)和Y(I)是字符的字符串,找到最长串x,它是所有在Y串子串。

Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.

我有一个函数子(X,Y)返回布尔描绘x是否是y或不是一个字符串。显然,我必须串联Y中所有的字符串,形成一个大的字符串,比如,记为B.我认为以下途径 - :

I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:

  • 朴素:首先连接所有串x中,形成一个字符串A(N)。应用子(A(N),B) - 这包括迭代落后的字符串A(N)。如果为真,该算法在这里结束并返回A(N) - 或它的任何部分包括在所述子。如果没有,请继续申请(A(N - 1),B)等。如果这样的字符串无X存在,我返回空字符串。
  • Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.

显然,这种方法会占用相当长的运行时间取决于执行。假设我用迭代的方法,在每次迭代我将不得不向后遍历字符串在该水平/指数,并随后申请子()。这将需要ATLEAST两个环,和 O(尺寸(B)*最大长度(X1,X2,...))最坏情况下的时间,或更大,取决于子() (纠正我,如果错了)。

Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).

我基于后缀树/阵列想到了第二种方法。

I thought of a second approach based on suffix trees/arrays.

  • 广义后缀树:我建立序列Y的使用Ukkonen的算法,商品及服务税的 O(最大长度(Y1,Y2,...) (?)。我缺乏后缀树知识的叮咬。我相信一个后缀树的方法将大大减少运行时间(以成本的空间)查找子串,但我不知道如何执行该操作。
  • Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...)(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.

如果有更好的办法,我很想知道。

If there is a better approach, I'd love to know.

编辑:如果这似乎是我抛弃了专题道歉

Apologies if it seemed like I abandoned the topic.

如果我要使用不是商品及服务税,但一些标准数据结构,如栈,队列,集堆,优先级队列等?序列X将必须进行排序,最大串首先,自然。如果我将其存储在一个字符串数组,我将不得不使用一个排序算法,如归并/快速排序。我们的目标是获得最有效的运行时间尽可能

What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.

我不能存放X系统在结构,因为它建立本身自动排序的元素呢?怎么样一个最大堆?

Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?

这似乎像后缀树是找到这种方式的子串的最佳途径。是否有任何其他的数据结构,我可以用?

It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?

推荐答案

书写| Y |对于串的集合中的Y,和len(Y),对于它们的总长度的数目:

Writing |Y| for the number of strings in the set Y, and len(Y) for their total length:

  1. 工艺Y中的字符串成广义后缀树(例如,使用 Ukkonen算法)。需要时间O(LEN(Y)),假设一个固定尺寸的字母。

  1. Process the strings in Y into a generalized suffix tree (for example, using Ukkonen's algorithm). Takes time O(len(Y)), assuming a constant-size alphabet.

标记中,根据确定该节点的字符串是否属于在Y的所有字符串的后缀树的每个节点需要时间O(| Y | LEN(Y))。

Mark each node in the suffix tree according to whether the string identified by that node belongs to all the strings in Y. Takes time O(|Y| len(Y)).

有关在X中每个字符串,看它的后缀树,看看节点被标记为属于所有的字符串Y.输出上最长的标记字符串。需要时间O(LEN(X))。

For each string in X, look it up in the suffix tree and see if the node is marked as belonging to all the strings in Y. Output the longest such marked string. Takes time O(len(X)).

总时间:O(| Y | LEN(Y))+ O(LEN(X))

Total time: O(|Y| len(Y)) + O(len(X)).

这篇关于在字符串中的两个序列最长共同子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆