完整的后缀数组 [英] Complete Suffix Array

查看:207
本文介绍了完整的后缀数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个后缀数组将索引中的所有后缀为字符串给定的名单,但如果你想索引的所有可能的唯一子是什么?我有点在这个新的,所以在这里我的意思的例子:

A suffix array will index all the suffixes for a given list of strings, but what if you're trying to index all the possible unique substrings? I'm a bit new at this, so here's an example of what I mean:

鉴于字符串

abcd

一个后缀数组索引(至少我的理解)

A suffix array indexes (at least to my understanding)

(abcd,bcd,cd,d)

我想指数(所有子)

I would like to index (all the substrings)

(abcd,bcd,cd,d,abc,bc,c,ab,b,a)

是一个后缀数组我在找什么?如果是这样,我该怎么办,让所有的子索引?如果不是这样,我应该在哪里寻找?还有什么我谷歌的对比全子与后缀子?

Is a suffix array what I'm looking for? If so, what do I do to get all the substrings indexed? If not, where should I be looking? Also what would I google for to contrast "all substrings" vs "suffix substrings"?

推荐答案

后缀数组做什么,你需要不已,因为每个子是后缀之一的preFIX。具体来说,鉴于你的后缀列

The suffix array does what you need already, because every substring is a prefix of one of the suffixes. Specifically, given your suffix array

ABCD BCD 光盘 ð

abcd bcd cd d

和假设你正在寻找串BC,那么你就可以发现,通过查找以BC所有后缀(只有一个在这种情况下,BCD)。因为一个后缀数组字典顺序排序,发现共享特定preFIX所有后缀对应于横跨后缀数组二进制搜索,结果将是后缀阵列的条目之一连续范围

and assume you are looking for substring "bc", then you can find that by looking for all suffixes that start with "bc" (there is only one in this case, "bcd"). Since a suffix array is lexicographically sorted, finding all suffixes that share a certain prefix corresponds to a binary search across the suffix array, and the result will be one continuous range of entries of the suffix array.

使用后缀阵列结合于辅助数据结构,诸如LCP(最长公共preFIX)阵列,或小波树木然而,有优化的搜索的方法。见纳瓦罗的2007年的调查对这些方法的描述(DOI 10.1145 / 1216370.1216372)。

However, there are optimised search methods using the suffix array combined with auxiliary data structures, such as the LCP (longest-common prefix) array, or wavelet trees. See Navarro's 2007 survey for a description of such methods (DOI 10.1145/1216370.1216372).

要考虑到下面提出的意见,我建议结合每个后缀具有的子数它重新presents 的。在一个简单的例子,如以上,这将是

To take into account the comments made below, I suggest combining each suffix with the number of substrings it represents. In a simple example like the above this would be

4 abcd
3 bcd
2 bc
1 d

,因为,例如,第一后缀ABCD重新presents 4子一,AB,ABC,ABCD。然而,在一个更​​复杂的例子,说为字符串abcabxdabe,后缀数组的前两个条目。将

because, for example, the first suffix "abcd" represents the 4 substrings "a", "ab", "abc", "abcd". However, in a more complex example, say for the string "abcabxdabe", the first two entries of the suffix array would be

10 abcabxdabe
1 abe

,因为第二个条目重新presents子一,AB和安倍,而是一个和从头也重新由第一条目psented $ P $

because the second entry represents substrings "a", "ab" and "abe", but "a" and "ab" are also represented by the first entry.

如何计算串的数量再presents的项目? - >后缀的长度减去最长preFIX它与相同previous后缀的长度。例如。在阿部的例子,即3(它的长度)减去2(AB,最长preFIX它与previous条目共享的长度)。因此,这些数字可以在一个传过来的后缀阵列,并产生更快,如果你还产生了LCP(最长共同preFIX)阵列。

How to calculate the number of substrings an entry represents? --> The length of the suffix minus the length of the longest prefix it has in common with the previous suffix. E.g. in the "abe" example, that is 3 (its length) minus 2 (the length of "ab", the longest prefix it shares with the previous entry). So these numbers can be generated in one pass over the suffix array, and even faster if you have also generated the LCP (longest-common prefix) array.

下一步将是产生累加计数:

The next step would be to generate accumulated counts:

10 abcabxdabe
11 abe
16 abxdabe
...

,然后找到一种有效的方式,利用累积计数。例如。如果你想获得的第13子字典顺序,你必须找到一个具有累计数大于或等于13,那将是16 abxdabe上面的第一个条目。然后取出preFIX它与previous条目(收益率xdabe)的股票,然后在第2个字符后跳转到位置(因为previous项已积累了数11,和13-11 == 2),这样就可以获得abxd在13子字典序。

and then to find an efficient way to make use of the accumulated counts. E.g. if you want to get the 13th substring lexicographically, you'd have to find the first entry that has an accumulated count greater than or equal to 13. That would be "16 abxdabe" above. Then remove the prefix it shares with the previous entry (yields "xdabe"), and then jump to the position after the 2nd character (because the previous entry has accumulated count 11, and 13-11==2), so you get "abxd" as the 13th substring lexicographically.

这篇关于完整的后缀数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆