找出是否存在两个相邻的相同子字符串 [英] Finding out whether there exist two identical substrings one next to another
问题描述
我们有一个字符串.
ABAEABABEABE
ABAEABABEABE
现在,我们必须检查是否存在一个子字符串,其后跟着另一个与第一个子字符串完全相同的子字符串.
Now we've got to check whether there exist a substring that is followed next by another substring that's exactly the same as the first one.
在此示例中:
ABAEAB ABE ABE
ABE后跟ABE,这是两个相同的子字符串.
In this example:
ABAEABABEABE
ABE is followed by ABE and that are two identical substrings.
在此示例中:
AAB
只是A,因为A后面跟着另一个A.
在此示例中:
ABCDEFGHIJKLMNO
没有这样的子字符串,因此答案为否.
In this example:
AAB
It would be simply A, beacuse A is followed by another A.
In this example:
ABCDEFGHIJKLMNO
There doesn't exist such a substring, so the answer would be NO.
我只设法找到了可以在O(n ^ 2)中运行的算法.那就是哈希值及其前缀.然后,对于每个字母,我们简单地展开并检查所有以该字母结尾的单词.有n个字母.我们需要将其扩展n次.所以是O(n ^ 2).我相信应该有一个O(n log n)算法来解决这个问题.
I only managed to find an algorithm that would run in O(n^2). That is getting Hashes and its prefixes. Then for each letter we expand simply and check all the words ending on that letter. There are n letters. We need to expand it n times. So it's O(n^2). I believe there should be a O(n log n) algorithm for this problem.
有人有更好的主意吗?
推荐答案
我想您希望遵循此模式的最长子字符串.
I guess you want the longest substring possible that follows this pattern.
要做的第一件事是构建输入字符串的后缀树.使用 Ukkonen算法,它是 O(n).
The first thing to do is to build a Suffix tree of the input string. Using Ukkonen's algorithm, this is O(n).
现在,您提供的条件如何在后缀树中转换?首先,您正在寻找一个重复的子字符串 [1] . 重复的子字符串将显示为后缀树的内部节点. 后缀中的最大节点数由 n -char字符串构建的树为 2n-1 .
Now, how does the condition you provided translate in the suffix tree? First things first, you are looking for a repeated substring [1]. Repeated substrings will appear as internal nodes of the suffix tree. The maximum number of nodes in a suffix tree built from a n-char string is 2n - 1.
您可以使用它们的长度(字符数)构建包含此类重复子字符串的Max-Heap. 您不不要让子字符串的长度超过 N/2 (请参阅[1]).这是 O(N),其中 N 是后缀树的内部节点数.对于任何后缀树:
You can build a Max-Heap containing such repeated substrings, using their length (number of chars). You do NOT keep substrings of length superior to N/2 (see [1]). This is O(N) where N is the number of internal nodes of the suffix tree. For any suffix tree:
0≤ N ≤ n-2
现在,您将max从优先级队列中取出,并处理获得的内部节点 i :
Now, you take the max out of the priority queue and process the internal node i you obtained:
- 让 S i 是与 i , k = 0和 curnode = i
- k <长度( S i )
- Let Si be the substring related to i, k = 0 and curnode = i
- While k < length(Si)
- 如果从 i 到 i 的子代的键等于 S i [k] ,然后 k = k + 1
- 否则会打破循环.
- If the key from i to a child of i is equal to Si[k], then k = k+1
- Else break the loop.
复杂性摘要
让 n 为查询字符串的长度.
Complexity summary
Let n be the length of the query string.
- 构建后缀树: O(n)
- 建立重复子串的最大堆: [3]
- 识别重复的子字符串(即内部节点)并将其存储在数组中: O(n)
- 堆数组: O(n)
- Building the suffix tree : O(n)
- Building the Max-heap of repeated substrings: [3]
- Identifying the repeated substrings (ie. internal nodes) and storing them in an array: O(n)
- Heapify the array: O(n)
因此,最坏情况的总体复杂度是上述总和,并且是 O(n².log(n)).
Hence the overall worst case complexity is the sum of the above and is O(n².log(n)).
我在上面做了算法...因此它不是次优的,如果您足够勇敢,则可以遍历本文描述了线性时间算法!无论如何,后缀树是解决此问题的关键,因此我建议您仔细研究它们.
I made the algorithm above... Hence it is suboptimal, if you are brave enough, you can go through this paper that describes a linear time algorithm! In any case, Suffix trees are a key to this problem so I suggest you study them thoroughly.
[1] :警告,重复的子字符串可能会部分重叠!
[1]: Warning, repeated substrings may partially overlap!
[2] :实际上,最坏情况下的复杂性要比这个非常幼稚的上限要好,但是我不知道如何证明这一点(还好吗?!).例如,如果存在 n-2 个内部节点,则意味着原始字符串由出现了相同字符的 n 个组成.在这种情况下,我们检查的第一个子字符串是match =>,它是O(n.log(n)).
[2]: Actually, the worst case complexity is better than this very naive upper bound but I don't know how to prove it (yet?!). For example, if there were n - 2 internal nodes, that would mean that the original string consists of n occurrences of the same character. In that case, the first substring we check is a match => it's O(n.log(n)).
[3] :如果我们用常规排序方式( O(n.log(n)))替换堆构造,则最后的比较步骤将在 O(n²)而不是 O(n².log(n)) ...降低了 O(n.log(n))(由于排序步骤)和 O(n²).
[3]: If we replace the heap construction by a regular sort (O(n.log(n))), the final comparison step runs in O(n²) instead of O(n².log(n)) ... Taking down the overall complexity between O(n.log(n)) (due to the sorting step) and O(n²).
这篇关于找出是否存在两个相邻的相同子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!