后缀数组算法 [英] Suffix Array Algorithm

查看:122
本文介绍了后缀数组算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在相当多的阅读,我已经想通了什么后缀阵列和阵列LCP重presents。

After quite a bit of reading, I have figured out what a suffix array and LCP array represents.

后缀数组:重新presents阵列的每个后缀的_lexicographic等级

Suffix array: Represents the _lexicographic rank of each suffix of an array.

LCP阵列:包含连续两个后缀的最大长度为preFIX的比赛,他们是后的排序字典序

LCP array : Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.

我一直在努力,因为几天,究竟在后缀阵列和LCP算法的工作。

I have been trying hard to understand since a couple of days , how exactly the suffix array and LCP algorithm works.

下面是code,这是从 codeforces

Here is the code , which is taken from Codeforces:

/*
Suffix array O(n lg^2 n)
LCP table O(n)
*/
#include <cstdio>
#include <algorithm>
#include <cstring>

using namespace std;

#define REP(i, n) for (int i = 0; i < (int)(n); ++i)

namespace SuffixArray
{
    const int MAXN = 1 << 21;
    char * S;
    int N, gap;
    int sa[MAXN], pos[MAXN], tmp[MAXN], lcp[MAXN];

    bool sufCmp(int i, int j)
    {
        if (pos[i] != pos[j])
            return pos[i] < pos[j];
        i += gap;
        j += gap;
        return (i < N && j < N) ? pos[i] < pos[j] : i > j;
    }

    void buildSA()
    {
        N = strlen(S);
        REP(i, N) sa[i] = i, pos[i] = S[i];
        for (gap = 1;; gap *= 2)
        {
            sort(sa, sa + N, sufCmp);
            REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
            REP(i, N) pos[sa[i]] = tmp[i];
            if (tmp[N - 1] == N - 1) break;
        }
    }

    void buildLCP()
    {
        for (int i = 0, k = 0; i < N; ++i) if (pos[i] != N - 1)
        {
            for (int j = sa[pos[i] + 1]; S[i + k] == S[j + k];)
            ++k;
            lcp[pos[i]] = k;
            if (k)--k;
        }
    }
} // end namespace SuffixArray

我不能,就是不能熬过这个算法是如何工作的。我试着用铅笔和纸为例工作,并通过所涉及的步骤写了,但失去了联系,因为它太复杂的,至少对我来说。

I cannot, just cannot get through how this algorithm works. I tried working on an example using pencil and paper, and wrote through the steps involved, but lost link in between as its too complicated, for me at least.

对任何解释,用一个例子也许帮助,是非常AP preciated。

Any help regarding explanation, using an example maybe, is highly appreciated.

推荐答案

这是为O(n log n)的算法后缀数组构造(或者更确切地说,这将是,如果不是 ::排序 2通桶排序,已被使用)。

Overview

This is an O(n log n) algorithm for suffix array construction (or rather, it would be, if instead of ::sort a 2-pass bucket sort had been used).

它的工作原理,首先分拣2克(*),然后是4克的原始字符串,那么8克,等等,取值,所以在第i次迭代中,我们排序2 I -grams。可有明显的不超过数 2 (N)这样的迭代多,而且诀窍是,排序2 -grams在第i个步骤是通过促进肯定的是,每个比较两个2 -grams在O(1)时间(而不是O(2 )的时间)。完成

It works by first sorting the 2-grams(*), then the 4-grams, then the 8-grams, and so forth, of the original string S, so in the i-th iteration, we sort the 2i-grams. There can obviously be no more than log2(n) such iterations, and the trick is that sorting the 2i-grams in the i-th step is facilitated by making sure that each comparison of two 2i-grams is done in O(1) time (rather than O(2i) time).

它是如何做到这一点?好了,在第一次迭代它排序2克(又名双字母组),然后进行所谓的词典重命名的。这意味着它会创建一个新的数组(长度 N )的商店,每个两字组,它的排名的在两字组排序。

How does it do this? Well, in the first iteration it sorts the 2-grams (aka bigrams), and then performs what is called lexicographic renaming. This means it creates a new array (of length n) that stores, for each bigram, its rank in the bigram sorting.

示例的词典重命名:说我们有一个排序一些双字母组列表 {'AB','AB','CA', CD,CD,EA} 。然后,我们指定的的(即词典名),由左到右,从0级去和递增排名每当我们遇到的的两字组的变化。那么,我们分配的行列如下:

Example for lexicographic renaming: Say we have a sorted list of some bigrams {'ab','ab','ca','cd','cd','ea'}. We then assign ranks (i.e. lexicographic names) by going from left to right, starting with rank 0 and incrementing the rank whenever we encounter a new bigram changes. So the ranks we assign are as follows:

ab : 0
ab : 0   [no change to previous]
ca : 1   [increment because different from previous]
cd : 2   [increment because different from previous]
cd : 2   [no change to previous]
ea : 3   [increment because different from previous]

这些队伍被称为的字典名称的。

现在,的下一次迭代,我们的排序4克。这涉及到很多不同的4克之间的比较。我们如何比较两个4克?好了,我们可以通过人物的性格比较。这将是每比较高达4的操作。但是相反,我们通过的比较他们查找的包含在其中两个双字母组,使用在previous步骤生成的等级表的行列。这一级别重新presents从previous 2克重排序的字典等级,因此,如果对任意给定的4克,它的第一个2-gram具有较高的排名比第2克的另外4克,那么它一定是按字典顺序更大的介于前两个字符的。因此,如果第一个2克用于两个4克的秩是相同的,它们必须在前两个字符是相同的的。换句话说,两个查找起坐的中等级表足以比较的两个4克所有4个字符。

Now, in the next iteration, we sort 4-grams. This involves a lot of comparisons between different 4-grams. How do we compare two 4-grams? Well, we could compare them character by character. That would be up to 4 operations per comparison. But instead, we compare them by looking up the ranks of the two bigrams contained in them, using the rank table generated in the previous steps. That rank represents the lexicographic rank from the previous 2-gram sort, so if for any given 4-gram, its first 2-gram has a higher rank than the first 2-gram of another 4-gram, then it must be lexicographically greater somewhere in the first two characters. Hence, if for two 4-grams the rank of the first 2-gram is identical, they must be identical in the first two characters. In other words, two look-ups in the rank table are sufficient to compare all 4 characters of the two 4-grams.

整理后,我们再次创造新的字典的名字,此时为4克。

After sorting, we create new lexicographic names again, this time for the 4-grams.

在第三次迭代中,我们需要通过8克进行排序。再次,从previous第二步查找窗口中的字典排​​名表足以比较所有8个字符的两个给定的8克。

In the third iteration, we need to sort by 8-grams. Again, two look-ups in the lexicographic rank table from the previous step are sufficient to compare all 8 characters of two given 8-grams.

等等。每次迭代有两个步骤:

And so forth. Each iteration i has two steps:

  1. 通过分类2 -grams,使用词典名称从previous循环,使2个步骤比较(即O(1)时间)每个

  1. Sorting by 2i-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each

创建新的字典的名字

我们重复,直到所有的2 -grams不同。如果出现这种情况,我们正在做。我们怎么知道,如果所有的有什么不同?好了,该字典的名字是整数递增序列,从0开始。因此,如果在一个迭代中产生的最高词典的名字是一样的 N-1 ,然后每次2 -gram必须被赋予它自己的,独特的字典名称。

We repeat this until all 2i-grams are different. If that happens, we are done. How do we know if all are different? Well, the lexicographic names are an increasing sequence of integers, starting with 0. So if the highest lexicographic name generated in an iteration is the same as n-1, then each 2i-gram must have been given its own, distinct lexicographic name.

现在让我们来看看在code证实这一切。是使用的变量如下: SA [] 为后缀阵列我们正在建设。 POS [] 是等级查找表(即它包含了字典的名称),具体而言, POS [K] 包含 k的字典名称 -th M-克previous步骤。 TMP [] 是用来帮助创建一个辅助阵列 POS []

Now let's look at the code to confirm all of this. The variables used are as follows: sa[] is the suffix array we are building. pos[] is the rank lookup-table (i.e. it contains the lexicographic names), specifically, pos[k] contains the lexicographic name of the k-th m-gram of the previous step. tmp[] is an auxiliary array used to help create pos[].

我给了code线之间进一步的解释:

I'll give further explanations between the code lines:

void buildSA()
{
    N = strlen(S);

    /* This is a loop that initializes sa[] and pos[].
       For sa[] we assume the order the suffixes have
       in the given string. For pos[] we set the lexicographic
       rank of each 1-gram using the characters themselves.
       That makes sense, right? */
    REP(i, N) sa[i] = i, pos[i] = S[i];

    /* Gap is the length of the m-gram in each step, divided by 2.
       We start with 2-grams, so gap is 1 initially. It then increases
       to 2, 4, 8 and so on. */
    for (gap = 1;; gap *= 2)
    {
        /* We sort by (gap*2)-grams: */
        sort(sa, sa + N, sufCmp);

        /* We compute the lexicographic rank of each m-gram
           that we have sorted above. Notice how the rank is computed
           by comparing each n-gram at position i with its
           neighbor at i+1. If they are identical, the comparison
           yields 0, so the rank does not increase. Otherwise the
           comparison yields 1, so the rank increases by 1. */
        REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);

        /* tmp contains the rank by position. Now we map this
           into pos, so that in the next step we can look it
           up per m-gram, rather than by position. */
        REP(i, N) pos[sa[i]] = tmp[i];

        /* If the largest lexicographic name generated is
           n-1, we are finished, because this means all
           m-grams must have been different. */
        if (tmp[N - 1] == N - 1) break;
    }
}

关于比较功能

功能 sufCmp 用来比较两个(2 *间隙)-grams字典顺序。因此在第一次迭代将比较中bigram,在第二次迭代4克,然后8克等。这是由缺口,这是一个全局变量来控制。

The function sufCmp is used to compare two (2*gap)-grams lexicographically. So in the first iteration it compares bigrams, in the second iteration 4-grams, then 8-grams and so on. This is controlled by gap, which is a global variable.

一个天真的实施 sufCmp 将是这样的:

A naive implementation of sufCmp would be this:

bool sufCmp(int i, int j)
{
  int pos_i = sa[i];
  int pos_j = sa[j];

  int end_i = pos_i + 2*gap;
  int end_j = pos_j + 2*gap;
  if (end_i > N)
    end_i = N;
  if (end_j > N)
    end_j = N;

  while (i < end_i && j < end_j)
  {
    if (S[pos_i] != S[pos_j])
      return S[pos_i] < S[pos_j];
    pos_i += 1;
    pos_j += 1;
  }
  return (pos_i < N && pos_j < N) ? S[pos_i] < S[pos_j] : pos_i > pos_j;
}

此将比较(2 *间隙)-gram在第i个后缀的开始 pos_i:= SA [i]于与所述一个在开始时出现第j个后缀 pos_j:= SA [J] 。而且,它还将字符一个字符,即比较 S [pos_i] 比较它们与 S [pos_j] ,然后 S [pos_i + 1] S [pos_j + 1] 等。它继续只要字符是相同的。一旦它们不同,则返回1,如果该字符中的第i个后缀是比在第j个后缀,否则为0小。 (请注意,返回&LT; b 在返回 INT 函数意味着你返回1,如果条件为真,而0如果是假的。)

This would compare the (2*gap)-gram at the beginning of the i-th suffix pos_i:=sa[i] with the one found at the beginning of the j-th suffix pos_j:=sa[j]. And it would compare them character by character, i.e. comparing S[pos_i] with S[pos_j], then S[pos_i+1] with S[pos_j+1] and so on. It continues as long as the characters are identical. Once they differ, it returns 1 if the character in the i-th suffix is smaller than the one in the j-th suffix, 0 otherwise. (Note that return a<b in a function returning int means you return 1 if the condition is true, and 0 if it is false.)

在返回语句处理了的(2 *缺口)一种-grams位于字符串末尾的情况下,复杂的寻找状态。在这种情况下,无论是 pos_i pos_j 将达到 N 前所有(2 *间隙)字符被比较,即使到这一点的所有字符是相同的。然后,它将返回1,如果第i个后缀为末,和0,如果第j个后缀是在末端。这是正确的,因为如果所有字符都是相同的,的一个是字典序小。如果 pos_i 已经结束,第i个后缀必须比第j个后缀短。

The complicated looking condition in the return-statement deals with the case that one of the (2*gap)-grams is located at the end of the string. In this case either pos_i or pos_j will reach N before all (2*gap) characters have been compared, even if all characters up to that point are identical. It will then return 1 if the i-th suffix is at the end, and 0 if the j-th suffix is at the end. This is correct because if all characters are identical, the shorter one is lexicographically smaller. If pos_i has reached the end, the i-th suffix must be shorter than the j-th suffix.

显然,这自然实现为O(间隙),即它的复杂度是线性的(2 *间隙)的长度-grams。在code所用的功能,但是,使用字典式名字,使此下降到O(1)(特别是,向下到最多两个比较):

Clearly, this naive implementation is O(gap), i.e. its complexity is linear in the length of the (2*gap)-grams. The function used in your code, however, uses the lexicographic names to bring this down to O(1) (specifically, down to a maximum of two comparisons):

bool sufCmp(int i, int j)
{
  if (pos[i] != pos[j])
    return pos[i] < pos[j];
  i += gap;
  j += gap;
  return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}

正如你所看到的,而不是仰视单个字符 S [I] S [J] ,我们检查第i和j个后缀的字典军衔。辞书的队伍被计算在previous迭代间隙克。所以,如果 POS [1] - ; POS [J] ,那么第i个后缀 SA [I] 必须以间隙克是字典序比gap-小克在开始SA [J] 。换句话说,只要通过查找 POS [I] POS [J] 和比较他们,我们比较第一次的缺口的两个后缀的字符。

As you can see, instead of looking up individual characters S[i] and S[j], we check the lexicographic rank of the i-th and j-th suffix. Lexicographic ranks were computed in the previous iteration for gap-grams. So, if pos[i] < pos[j], then the i-th suffix sa[i] must start with a gap-gram that is lexicographically smaller than the gap-gram at the beginning of sa[j]. In other words, simply by looking up pos[i] and pos[j] and comparing them, we have compared the first gap characters of the two suffixes.

如果队伍是相同的,我们将继续通过比较 POS [1 +差距] POS [J +差距] 。这是相同的比较下的间隙的的(2 *间隙)字符-grams,即第二半的。如果队伍中有张玉峰再次,两(2 *间隙)-grams是张玉峰,所以我们返回0。否则,我们返回1,如果第i个后缀比第j个后缀,否则为0。

If the ranks are identical, we continue by comparing pos[i+gap] with pos[j+gap]. This is the same as comparing the next gap characters of the (2*gap)-grams, i.e. the second half. If the ranks are indentical again, the two (2*gap)-grams are indentical, so we return 0. Otherwise we return 1 if the i-th suffix is smaller than the j-th suffix, 0 otherwise.

下面的例子说明了算法的运行,并演示了特别的字典名称排序算法中的作用。

The following example illustrates how the algorithm operates, and demonstrates in particular the role of the lexicographic names in the sorting algorithm.

我们要排序的字符串 abcxabcd 。这需要三个迭代来生成后缀数组这一点。在每次迭代中,我将展示取值(串), SA (后缀阵列的当前状态)和 TMP POS ,从而重新present的字典名称。

The string we want to sort is abcxabcd. It takes three iterations to generate the suffix array for this. In each iteration, I'll show S (the string), sa (the current state of the suffix array) and tmp and pos, which represent the lexicographic names.

首先,我们初始化:

S   abcxabcd
sa  01234567
pos abcxabcd

请注意如何在字典的名称,最初再present对unigram的字典军衔,只是相同的字符(即对unigram)本身。

Note how the lexicographic names, which initially represent the lexicographic rank of unigrams, are simply identical to the characters (i.e. the unigrams) themselves.

第一次迭代:

排序 SA ,使用双字母组作为排序标准:

Sorting sa, using bigrams as sorting criterion:

sa  04156273

前两个后缀是0和4,因为这些都是两字'AB'的位置。然后1和5(两字组BC的位置),那么6(两字CD),则2(两字组CX)。那么7(不完全两字组D),则3(两字XA)。显然,位置对应的顺序,只对字符双字母组为主。

The first two suffixes are 0 and 4 because those are the positions of bigram 'ab'. Then 1 and 5 (positions of bigram 'bc'), then 6 (bigram 'cd'), then 2 (bigram 'cx'). then 7 (incomplete bigram 'd'), then 3 (bigram 'xa'). Clearly, the positions correspond to the order, based solely on character bigrams.

生成字典的名称:

tmp 00112345

如所描述的,词典式名称分配作为增加整数。前两个后缀(二者开始两字组AB)得到0,接下来的两个(这两个开始的两字组BC)得到1,则2,3,4,5(每一个不同的两字组)。

As described, lexicographic names are assigned as increasing integers. The first two suffixes (both starting with bigram 'ab') get 0, the next two (both starting with bigram 'bc') get 1, then 2, 3, 4, 5 (each a different bigram).

最后,我们根据 SA 的位置映射此,要获得 POS

Finally, we map this according to the positions in sa, to get pos:

sa  04156273
tmp 00112345
pos 01350124

(方式 POS 生成是这样的:经过 SA 由左到右,并使用入门定义在 POS ,使用 TMP 相应的条目来定义该索引的值。所以<$指数C $ C> POS [0] = 0 , POS [4]:= 0 POS [1] := 1 POS [5]:= 1 POS [6]:= 2 ,等等。该指数来源于 SA ,从值 TMP

(The way pos is generated is this: Go through sa from left to right, and use the entry to define the index in pos. Use the corresponding entry in tmp to define the value for that index. So pos[0]:=0, pos[4]:=0, pos[1]:=1, pos[5]:=1, pos[6]:=2, and so on. The index comes from sa, the value from tmp.)

第二次迭代:

我们的排序 SA 再次,我们再次来看看双字母组从 POS (每个重presents序列的两个的原始字符串的双字母组)。

We sort sa again, and again we look at bigrams from pos (which each represents a sequence of two bigrams of the original string).

sa  04516273

请注意1 5的位置已经交换相比 SA 的previous版本。它曾经是15,现在是51。这是因为两字组在 POS [1] 和两字组在 POS [5] 的previous迭代过程中使用的是相同的(包括 BC )的,但现在两字组在 POS [5] 12 ,而两字组在 POS [1] 13 。所以位置 5 自带的的位置 1 。这是由于一个事实,即字典的名称,现在每个重新原始字符串的present双字母组: POS [5] 重presents BC POS [6] 重presentsCD。于是,他们一起重新present BCD ,而 POS [1] 重presents BC POS [2] 重presents CX ,所以他们一起重新present BCX ,这确实是字典序比 BCD

Notice how the position of 1 5 have switched compared to the previous version of sa. It used to be 15, now it is 51. This is because the bigram at pos[1] and the bigram at pos[5] used to be identical (both bc) in during the previous iteration, but now the bigram at pos[5] is 12, while the bigram at pos[1] is 13. So position 5 comes before position 1. This is due to the fact that the lexicographic names now each represent bigrams of the original string: pos[5] represents bc and pos[6] represents 'cd'. So, together they represent bcd, while pos[1] represents bc and pos[2] represents cx, so together they represent bcx, which is indeed lexicographically greater than bcd.

此外,我们生成字典的名称从左至右筛选 SA 的当前版本和 POS比较造成相应的双字母组

Again, we generate lexicographic names by screening the current version of sa from left to right and comparing the corrsponding bigrams in pos:

tmp 00123456

前两个条目仍相同(均为0),因为在 POS对应的双字母组都是 01 。剩下的就是一个严格递增整​​数序列,因为 POS所有其他双字母组是独特的。

The first two entries are still identical (both 0), because the corresponding bigrams in pos are both 01. The rest is an strictly increasing sequence of integers, because all other bigrams in pos are each unique.

我们执行映射到新的 POS 和以前一样(从 SA 以指数和值从 TMP ):

We perform the mapping to the new pos as before (taking indices from sa and values from tmp):

sa  04516273
tmp 00123456
pos 02460135

第三次迭代:

我们的排序 SA 再次,服用 POS双字母组(一如既往),现在每次重新presents 4中bigram的原单串的序列

We sort sa again, taking bigrams of pos (as always), which now each represents a sequence of 4 bigrams of the orginal string.

sa  40516273

您会发现,现在前两项已切换的位置: 04 已成为 40 。这是因为两字组在 POS [0] 02 ,而那个在 POS [ 4] 01 ,后者显然是字典序小。深层次原因是这两个再present abcx ABCD ,分别为。

You'll notice that now the first two entries have switched positions: 04 has become 40. This is because the bigram at pos[0] is 02 while the one at pos[4] is 01, the latter obviously being lexicographically smaller. The deep reason is that these two represent abcx and abcd, respectively.

生成字典的名字收益率:

Generating lexicographic names yields:

tmp 01234567

他们都是不同的,即最高的是 7 ,这是 N-1 。所以,我们都做了,因为是整理现在基于M-克的都不同。即使我们继续,排序顺序不会改变。

They are all different, i.e. the highest one is 7, which is n-1. So, we are done, because are sorting is now based on m-grams that are all different. Even if we continued, the sorting order would not change.

用于2个排序的算法 -grams在每次迭代似乎是内置的排序(或的std ::排序)。这意味着它是一个比较排序,这需要为O(n log n)的时间,在最坏的情况下,的在每次迭代的。既然有日志n次迭代,在最坏的情况下,这使得它为O(n(log n)的 2 ) - 时间算法。但是,可以通过使用桶排序的两道进行排序,因为我们使用的那种比较(在previous步骤即字典的名称)键,形成一个递增的整数序列。因此,这可以提高到实际为O(n log n)的 - 时间算法后缀排序。

The algorithm used to sort the 2i-grams in each iteration appears to be the built-in sort (or std::sort). This means it's a comparison sort, which takes O(n log n) time in the worst case, in each iteration. Since there are log n iterations in the worst case, this makes it a O(n (log n)2)-time algorithm. However, the sorting could by performed using two passes of bucket sort, since the keys we use for the sort comparison (i.e. the lexicographic names of the previous step), form an increasing integer sequence. So this could be improved to an actual O(n log n)-time algorithm for suffix sorting.

我相信这是原来的算法后缀数组建设的建议在1992年论文曼伯和迈尔斯(<一href="http://scholar.google.com/scholar?hl=ja&q=manber+myers+a+new+method+for+online+string+searches&btnG=&lr="相对=nofollow>在谷歌学术搜索的链接,它应该是第一个击中,它可能有一个链接到PDF那里)。这(在同一时间,但独立的一篇论文贡内特和巴埃萨 - 耶茨)是什么介绍了后缀阵列(也称拍阵列的时间)作为数据结构,有趣的进一步研究。

I believe this is the original algorithm for suffix array construction that was suggested in the 1992-paper by Manber and Myers (link on Google Scholar; it should be the first hit, and it may have a link to a PDF there). This (at the same time, but independently of a paper by Gonnet and Baeza-Yates) was what introduced suffix arrays (also known as pat arrays at the time) as a data structure interesting for further study.

现代算法后缀数组构造是为O(n),因此上述不再可用(至少在理论,最坏情况下的复杂性而言)的最佳算法。

Modern algorithms for suffix array construction are O(n), so the above is no longer the best algorithm available (at least not in terms of theoretical, worst-case complexity).

(*) 2克的我指的是一个序列中的两个的连续的字符的原始字符串。例如,当 S = ABCDE 是字符串,那么 AB BC 光盘是2克的S 。同样, ABCD BCDE 是4克。一般,一个m克(为一个正整数米)是连续字符序列。 1-克也称为unigram进行,2-克被称为双字母组,3-克称为卦。有些人继续tetragrams,五角星等。

(*) By 2-gram I mean a sequence of two consecutive characters of the original string. For example, when S=abcde is the string, then ab, bc, cd, de are the 2-grams of S. Similarly, abcd and bcde are the 4-grams. Generally, an m-gram (for a positive integer m) is a sequence of m consecutive characters. 1-grams are also called unigrams, 2-grams are called bigrams, 3-grams are called trigrams. Some people continue with tetragrams, pentagrams and so on.

注意 S的后缀启动和位置,是的(NI)-gram 取值。此外,每间克(对任何m)为取值的后缀之一的preFIX。因此,分选间 - 克(对于米尽可能大)可以朝排序后缀的第一步。

Note that the suffix of S that starts and position i, is an (n-i)-gram of S. Also, every m-gram (for any m) is a prefix of one of the suffixes of S. Therefore, sorting m-grams (for an m as large as possible) can be the first step towards sorting suffixes.

这篇关于后缀数组算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆