最长公共prefixes [英] Longest Common Prefixes

查看:138
本文介绍了最长公共prefixes的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我构建了一个后缀阵列,即整数给人一种字符串字典顺序的所有后缀的起始位置的数组。

Suppose I constructed a suffix array, i.e. an array of integers giving the starting positions of all suffixes of a string in lexicographical order.

例:对于一个字符串海峡= abcabbca

后缀数组是:

suffixArray[] = [7 3 0 4 5 1 6 2]


说明:


Explanation:

i   Suffix      LCP of str and str[i..]   Length of LCP
7   a           a                           1
3   abbca       ab                          2
0   abcabbca    abcabbca                    8
4   bbca        empty string                0
5   bca         empty string                0
1   bcabbca     empty string                0
6   ca          empty string                0
2   cabbca      empty string                0


现在这个 suffixArray 构造,我想找到的<$ C的最长公共preFIX(LCP)在长$ C> STR (字符串本身)和其他每个后缀的。什么是最有效的方式办呢?


Now with this suffixArray constructed, I want to find the length of the Longest Common Prefix (LCP) between str (the string itself) and each of the other suffixes. What is the most efficient way to do it?

推荐答案

根据您的意见,我会认为,我们有机会获得后缀列 SA 以及标准 LCP 阵列,即数据结构,它告诉我们,在索引i> 0,什么后缀的最长的共同preFIX长度 SA [I] 及其字典predecessor ​​ SA [I-1]

Based on your comment, I'll assume we have access to the suffix array SA as well as the standard LCP array, i.e. a data structure that tells us, at index i>0, what the length of the longest common prefix of suffix SA[i] and its lexicographic predecessor SA[i-1] is.

我会用字母L是指我们要构建的特殊LCP阵列,如问题描述。我将使用字母N是指输入字符串的长度 STR

I'll use the letter L to refer to the special LCP array we want to construct, as described in the question. I'll use the letter N to refer to the length of the input string str.

那么我们可以做的是这样的:

Then what we can do is this:

  1. 确定 STR 的后缀数组中的位置。我们可以通过筛选 SA 线性查找条目 0 做到这一点。 (说明: STR STR 开始,因此位置0, 0 <后缀/ code>必须作为后缀数组的一个项目。)

  1. Determine the position of str within the suffix array. We can do this by screening SA linearly to find the entry 0. (Explanation: str is the suffix of str starting at position 0. Therefore, 0 must appear as an entry of the suffix array.)

假设我们发现条目是在索引k。然后,我们可以设置 L [K]:= N ,我们因为 SA [K] 是字符串本身具有在常见的N个字符与本身preFIX。

Suppose the entry we find is at index k. Then we can set L[k]:=N, we because SA[k] is the string itself and has a prefix of N characters in common with itself.

然后,我们可以设置 L [K-1]:= LCP [K] L [K + 1]:= LCP [K + 1] ,因为这是怎样的标准LCP定义。

Then we can set L[k-1]:=LCP[k] and L[k+1]:=LCP[k+1] because that is how the standard LCP is defined.

然后我们倒退从我:= K-2降至0,并设置

Then we go backwards from i:=k-2 down to 0 and set

L[i] := min(LCP[i+1],L[i+1])

这工作,因为在每次迭代我, LCP [I + 1] 告诉我们相邻后缀的最长的共同preFIX SA [I] SA [I + 1] L [I + 1] 告诉我们,previously处理后缀 SA [I + 1] 键,输入字符串 STR <最长共同preFIX / code>。 L [I] 必须是最小的两个,因为 L [I] 表示多久preFIX SA [I] 的共同之处 STR ,并且不能长于preFIX有在共同与 SA [I + 1] ,否则其后缀数组中的位置将接近到k。

This works because, at each iteration i, LCP[i+1] tells us the longest common prefix of the adjacent suffixes SA[i] and SA[i+1], and L[i+1] tells us the longest common prefix of the previously processed suffix SA[i+1] and the input string str. L[i] must be the minimum of those two, because L[i] indicates how long a prefix SA[i] has in common with str, and that cannot be longer than the prefix it has in common with SA[i+1], otherwise its position in the suffix array would be closer to k.

我们也算着从我:= K + 2 N和设置

We also count forward from i:=k+2 to N and set

L[i] := min(LCP[i],L[i-1])

根据同样的道理。

随后的所有N值被设定,前后历时不超过O(n)的时间,假设随机访问数组和整数比较为O (1),分别

Then all N values of L have been set, and it took no more than O(N) time, assuming that random access to the arrays and integer comparison are O(1), respectively.

由于我们计算的长度为N个条目,一对O(N)的复杂阵列是最优的。

Since the array we compute is N entries in length, a complexity of O(N) is optimal.

(请注意,您可以在K-1和K + 1开始,在步骤4和5环,分别与摆脱步骤3中的额外的步骤只是为了使解释的 - 希望 - - 一个小更容易执行)

这篇关于最长公共prefixes的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆