后缀阵列算法 [英] Suffix Array Algorithm

查看：244 发布时间：2017/4/3 11:19:37 c++ algorithm data-structures suffix-array

本文介绍了后缀阵列算法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

后缀数组：表示一个后缀数组和LCP数组

LCP数组：包含两个连续后缀之间的最大长度前缀匹配，后面是按字典顺序排序。

几天后，我一直在努力理解，后缀数组和LCP算法究竟如何工作。 / strong>

以下代码来自 Codeforces ：

  / * 
后缀数组O（n lg ^ 2 n）
 LCP表O （n）
 * / 
 #include< cstdio> 
 #include< algorithm> 
 #include< cstring> 
 
使用namespace std; 
 
 #define REP（i，n）for（int i = 0; i<（int）（n）; ++ i）
 
命名空间SuffixArray 
 {
 const int MAXN = 1< 21; 
 char * S; 
 int N，gap; 
 int sa [MAXN]，pos [MAXN]，tmp [MAXN]，lcp [MAXN]; 
 
 bool sufCmp（int i，int j）
 {
 if（pos [i]！= pos [j]）
 return pos [i] pos [j]; 
 i + = gap; 
 j + = gap; 
 return（i< N&& j< N）？ pos [i] pos [j]：i> j 
} 
 
 void buildSA（）
 {
 N = strlen（S）; 
 REP（i，N）sa [i] = i，pos [i] = S [i]; 
 for（gap = 1 ;; gap * = 2）
 {
 sort（sa，sa + N，sufCmp）; 
 REP（i，N  -  1）tmp [i + 1] = tmp [i] + sufCmp（sa [i]，sa [i + 1]）; 
 REP（i，N）pos [sa [i]] = tmp [i]; 
 if（tmp [N  -  1] == N  -  1）break; 
} 
} 
 
 void buildLCP（）
 {
 for（int i = 0，k = 0; i< N; ++ i ）if（pos [i]！= N-1）
 {
 for（int j = sa [pos [i] + 1]; S [i + k] == S [j + ];）
 ++ k; 
 lcp [pos [i]] = k; 
 if（k）-k; 
} 
} 
} // end namespace SuffixArray

不能，只是无法通过这个算法如何工作。我试着用铅笔和纸张做一个例子，并通过所涉及的步骤写下来，但是至少在我之间丢失了链接，因为它太复杂了。对于我来说，至少我是

任何帮助关于解释，使用一个例子可能是高度赞赏。

解决方案

概述

这是一个O（n log n）后缀数组构造的算法（或者说，如果不是 :: sort ，则使用2遍桶类型）。

它首先对2克（*）进行排序，然后将原始字符串的$ 4，然后是8克等等。 c $ c> S ，所以在第i次迭代中，我们排序2 ⁱ -grams。显然可以不存在这样的迭代的log ₂（n），诀窍在于通过使第i个步骤中的2 ⁱ确保在O（1）时间（而不是O（2 ⁱ）时间）中完成两个2 ⁱ -grams的每个比较。

它如何做？那么，在第一次迭代中它会排列2克（又称双字号），然后执行所谓的字典重命名。这意味着它会创建一个新的数组（长度 n ），它可以在二进制排序中为每个二进制格式存储其等级。

词典重命名示例：说我们有一些排序列表，一些二进制格式 {'ab' 'ab'，'ca'，'cd'，'cd'，'ea'} 。然后，我们从左到右分配排名（即词典名称），从等级0开始，并且当我们遇到新的二进制变化时递增排名。所以我们分配的行列如下：

  ab：0 
 ab：0 [没有更改为上一个] 
 ca：1 [因为与以前不同而增加] 
 cd：2 [由于与以前不同而增加] 
 cd：2 [不改变到以前] 
 ea：3 [增加因为与以前不同]

这些排名称为词典名称。 p>

现在，在下一次迭代中，我们排序4克。这涉及不同4克之间的大量比较。我们如何比较两个4克？那么我们可以比较他们的性格。每比较最多4个操作。相反，我们使用前面步骤中生成的排名表来比较他们中包含的两个二元组的排名。这个等级代表了以前的2克排序的词典排名，所以如果对于任何给定的4克，它的第一个2克比另外4克的前2克更高，那么它必须是词典上的更大的在前两个字符的某个地方。因此，如果对于两个4克，则第一个2克的等级是相同的，它们在前两个字符中必须相同。换句话说，排名表中的两个查询足以比较两个4克的所有4个字符。

排序后，我们再次创建新的词典名称，这次是4克。

在第三次迭代中，我们需要排序8克。再次，上一步的词典排名表中的两个查询足以比较两个给定8克的所有8个字符。

等等。每次迭代 i 有两个步骤：

按2排序> i -grams，使用上一次迭代中的词典名称，可以在两个步骤（即O（1）时间）中进行比较，每个

创建新的词典名称

我们重复一遍，直到所有2 ⁱ -grams是不同的。如果发生这种情况，我们就完成了。我们怎么知道一切都不同？那么，词典名称是从0开始的整数的渐增序列。因此，如果在迭代中产生的最高词典名称与 n-1 相同，那么每个2必须具有自己独特的词典名称。

实施

现在我们来看看代码以确认所有这一切。使用的变量如下： sa [] 是我们正在构建的后缀数组。 pos [] 是排名查找表（即它包含词典名称），具体来说， pos [k] 包含上一步骤的 k 第m个单词的词典名称。 tmp [] 是用于帮助创建 pos [] 的辅助数组。

我将在代码行之间进一步说明：
void buildSA（） { N = strlen（S）; / *这是一个初始化sa []和pos []的循环。对于sa []，我们假设给定字符串中后缀为的顺序。对于pos []，我们使用字符本身设置每个1克的词典等级。这是有道理的吧？ * / REP（i，N）sa [i] = i，pos [i] = S [i]; / * Gap是每步中m-gram的长度，除以2. 我们以2克开始，所以gap最初为1。然后将增加到2,4,8等。 * / for（gap = 1 ;; gap * = 2） { / *我们排序（gap * 2）-grams：* / sort（sa， sa + N，sufCmp）; / *我们计算我们在上面排序的每个m-gram 的词典排名。通过将i位置的每个n-gram与i + 1的邻居进行比较，注意排名如何计算。如果它们相同，则比较产生0，因此排名不增加。否则比较产生1，所以等级增加1. * / REP（i，N - 1）tmp [i + 1] = tmp [i] + sufCmp（sa [i] sa [i + 1]）; / * tmp按位置包含排名。现在我们将这个映射到pos中，以便在下一步中，我们可以看到每m克的，而不是通过位置。 * / REP（i，N）pos [sa [i]] = tmp [i]; / *如果生成的最大字典名称是 n-1，我们完成了，因为这意味着所有 m-gram必须是不同的。 * / if（tmp [N - 1] == N - 1）break; } }
关于比较功能

功能 sufCmp 用于按字典顺序比较两个（2 *间隙）。所以在第一次迭代中，比较二进制，在第二次迭代中，4克，然后是8克等等。这是一个全局变量 gap 控制。

sufCmp 将是这样的：
bool sufCmp（int i，int j） { int pos_i = sa [i]; int pos_j = sa [j]; int end_i = pos_i + 2 * gap; int end_j = pos_j + 2 * gap; if（end_i> N） end_i = N; if（end_j> N） end_j = N; while（i< end_i&& j< end_j） { if（S [pos_i]！= S [pos_j]） return S [pos_i] S [pos_j]; pos_i + = 1; pos_j + = 1; } return（pos_i< N&& pos_j< N）？ S [pos_i] S [pos_j]：pos_i> pos_j }
这将比较开头的（2 * gap）第i个后缀 pos_i：= sa [i] 与第j个后缀 pos_j：= sa [j] 。它会比较它们的字符，即将 S [pos_i] 与 S [pos_j] ，然后 S [pos_i + 1] 与 S [pos_j + 1] 等等。只要字符相同，它就会继续。一旦它们不同，如果第i个后缀中的字符小于第j个后缀中的字符，则返回1，否则返回0。（请注意，在返回 int的函数中返回一个< b 意味着如果条件为真，则返回1，而0如果它是假的）。

返回语句中复杂的查找条件涉及（2 *间隙）图中的一个位于结尾的情况的字符串。在这种情况下， pos_i 或 pos_j 将达到 N 所有（2 *间隙）字符已被比较，即使到此为止的所有字符都相同。如果第i个后缀为末尾，则返回1，如果第j个后缀为末尾，则返回0。这是正确的，因为如果所有字符都相同，则较短的一个字典字体较小。如果 pos_i 已经结束，则第i个后缀必须比第j个后缀短。

显然，这个天真的实现是O（gap），即其复杂度在（2 * gap） - 图的长度上是线性的。然而，您的代码中使用的函数使用词典名称将其归结为O（1）（具体而言，最多可达两次比较）：
bool sufCmp（int i，int j） { if（pos [i]！= pos [j]） return pos [i] pos [j]; i + = gap; j + = gap; return（i< N&& j< N）？ pos [i] pos [j]：i> j }

正如你所看到的，而不是查找单个角色 S [i] 和 S [j] ，我们检查第i个和第j个后缀的词典排名。在以前的间隙克数迭代中计算了词典排名。所以，如果 pos [i] pos [j] ，那么第i个后缀 sa [i] 必须以一个间隔字符开头，克开始 sa [j] 。换句话说，通过查看 pos [i] 和 pos [j] 并比较它们，我们比较了两个后缀的第一个字符。

如果排名相同，我们继续比较 pos [i +间隙] 与 pos [j + gap] 。这与比较（2 *间隙）图形的下一个字符，即后半部分相同。如果这个等级再次是缩进的，那么两个（2 *间隙）图是不相关的，所以我们返回0.否则，如果第i个后缀小于第j个后缀，则返回1，否则返回0。

示例

以下示例说明了算法的运作方式，特别是排序算法中词典名称的作用。

我们要排序的字符串是 abcxabcd 。为此生成后缀数组需要三次迭代。在每次迭代中，我将显示 S （字符串）， sa （后缀数组的当前状态）和 tmp 和 pos ，代表词典名称。

首先，我们初始化：

  S abcxabcd 
 sa 01234567 
 pos abcxabcd

请注意，最初代表单字的词典排名的词典名称与字符（即单字）完全相同，

第一次迭代

排序 sa ，使用bigrams作为排序标准：

  sa 04156273

前两个后缀为0和4，因为这些是二进制ab的位置。然后1和5（bigram'bc'的位置），然后是6（bigram'cd'），然后是2（bigram'cx'）。然后7（不完整的二元组'd），然后3（bigram'xa'）。显然，这些职位对应于仅依赖于字符组合的顺序。

生成词典名称：

  tmp 00112345

如上所述，词典名称分配为增加整数。前两个后缀（两者都以二进制'ab'开头）得到0，接下来的两个（两者都以bigram'bc'开头）得到1，然后是2,3,4,5（每个不同的二进制）。

最后，我们根据 sa 中的位置对此进行映射，以获取 pos ：

  sa 04156273 
 tmp 00112345 
 pos 01350124

（ pos 的生成方式是：通过 sa 从左到右，使用条目在 pos 中定义索引，使用 tmp 定义该索引的值所以 pos [0]：= 0 ， pos [4]：= 0 ， pos [1]：= 1 ， pos [5]：= 1 code> pos [6]：= 2 等等。索引来自 sa ，值来自 tmp 。）

第二次迭代

再次对 sa 进行排序，再次从 pos （每个代表）对原始字符串的两个二进制的序列表示反感。

  sa 04516273

注意1 5的位置与先前版本的 sa 。以前是15岁，现在是51岁。这是因为 pos [1] 和 pos [5] 在上一次迭代中曾经是相同的（ bc ），但现在在$ code> pos [5] 是 12 ，而 pos [1] 中的bigram是 13 。所以位置 5 在位置 1 之前。这是因为字典名称现在各自代表原始字符串的双字母： pos [5] 表示 bc 和 pos [6] 表示'cd'。所以，他们一起代表 bcd ，而 pos [1] 表示 bc 和 pos [2] 表示 cx ，所以一起它们代表 bcx ，其确实是字典大于 bcd 。

 
 
 再次，我们通过筛选生成词典名称 c code code code code code code code code $ c 
 $ b 
  tmp 00123456 
  
两个条目仍然相同（两者都是0），因为 pos 中的相应的二进制文件都是 01 。其余的是一个严格增加的整数序列，因为 pos 中的所有其他二进制码都是唯一的。
 
 
 我们执行如前所述映射到新的 pos （从 sa 中的索引和 tmp ）：
  sa 04516273 
 tmp 00123456 
 pos 02460135 
  
 第三次迭代 
 
 
 我们再次对 sa 进行排序，采用 pos （一如既往）的二进制，现在每个代表一个4个二进制的序列原始字符串。
  sa 40516273 
  
你会注意到，现在前两个条目已经切换职位： 04 已经成为 40 。这是因为 pos [0] 中的bigram是 02 而 pos [ 4] 是 01 ，后者显然在字典上较小。最重要的原因是这两个代表分别代表 abcx 和 abcd 。
 
 
 生成词典名称产生：
  tmp 01234567 
  
它们都不同，即最高的一个是 7 ，这是 n- 1 。所以，我们完成了，因为现在排序是基于m-gram都是不同的。即使我们继续，排序顺序也不会改变。
 
 
 
 
 
 改进建议
 
 
 在每次迭代中用于对2 ⁱ -grams进行排序的算法似乎是内置的 sort （或 std :: sort ）。这意味着它是一种比较排序，在最坏的情况下，在每次迭代中都采用O（n log n）时间。由于在最坏情况下存在log n迭代，所以这使得它成为O（n（log n）²） - 时间算法。然而，排序可以通过使用两次桶排序来执行，因为我们用于排序比较的键（即前一步骤的词典名称）形成增加的整数序列。所以这可以改进为后缀排序的实际O（n log n）时间算法。
 
 
 
 
 
 备注< h1> 
 
 我相信这是由Manber和Myers在1992年的论文中提出的后缀数组构造的原始算法（ Google Scholar上的链接;它应该是第一个点击，它可能有一个PDF的链接）。这个（同时，但是独立于Gonnet和Baeza-Yates的一篇论文）是引入后缀数组（当时也称为pat数组），作为一个有意进一步研究的数据结构。
 
 
 后缀数组构造的现代算法为O（n），所以以上不再是可用的最佳算法（至少不是理论上的最坏情况）。
 
 
 
 
 
 
脚注
 
 
  ^（*）由 2克我的意思是原始字符串的两个连续字符的序列。例如，当 S = abcde 是字符串时，则 ab ， bc ， cd ， de 是2克 S 。同样地，$ code> abcd 和 bcde 是4克。通常，m-gram（对于正整数m）是 m 连续字符的序列。 1克也称为单体，2克称为二聚体，3克称为三元组。有些人继续使用四角形，五角星等等。
 
 
 请注意， S 的后缀起始位置 i ，是（$）$（$）$ S $ / code。另外，每个m-gram（对于任何m）是 S 之一的后缀的前缀。因此，排序m-gram（对于m尽可能大）可以是排序后缀的第一步。
 
After quite a bit of reading, I have figured out what a suffix array and LCP array represents.

Suffix array: Represents the _lexicographic rank of each suffix of an array.

LCP array : Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.

I have been trying hard to understand since a couple of days , how exactly the suffix array and LCP algorithm works.

Here is the code , which is taken from Codeforces:
/*
Suffix array O(n lg^2 n)
LCP table O(n)
*/
#include <cstdio>
#include <algorithm>
#include <cstring>

using namespace std;

#define REP(i, n) for (int i = 0; i < (int)(n); ++i)

namespace SuffixArray
{
    const int MAXN = 1 << 21;
    char * S;
    int N, gap;
    int sa[MAXN], pos[MAXN], tmp[MAXN], lcp[MAXN];

    bool sufCmp(int i, int j)
    {
        if (pos[i] != pos[j])
            return pos[i] < pos[j];
        i += gap;
        j += gap;
        return (i < N && j < N) ? pos[i] < pos[j] : i > j;
    }

    void buildSA()
    {
        N = strlen(S);
        REP(i, N) sa[i] = i, pos[i] = S[i];
        for (gap = 1;; gap *= 2)
        {
            sort(sa, sa + N, sufCmp);
            REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
            REP(i, N) pos[sa[i]] = tmp[i];
            if (tmp[N - 1] == N - 1) break;
        }
    }

    void buildLCP()
    {
        for (int i = 0, k = 0; i < N; ++i) if (pos[i] != N - 1)
        {
            for (int j = sa[pos[i] + 1]; S[i + k] == S[j + k];)
            ++k;
            lcp[pos[i]] = k;
            if (k)--k;
        }
    }
} // end namespace SuffixArray
I cannot, just cannot get through how this algorithm works. I tried working on an example using pencil and paper, and wrote through the steps involved, but lost link in between as its too complicated, for me at least.

Any help regarding explanation, using an example maybe, is highly appreciated. 
 解决方案 
Overview

This is an O(n log n) algorithm for suffix array construction (or rather, it would be, if instead of ::sort a 2-pass bucket sort had been used).

It works by first sorting the 2-grams^(*), then the 4-grams, then the 8-grams, and so forth, of the original string S, so in the i-th iteration, we sort the 2ⁱ-grams. There can obviously be no more than log₂(n) such iterations, and the trick is that sorting the 2ⁱ-grams in the i-th step is facilitated by making sure that each comparison of two 2ⁱ-grams is done in O(1) time (rather than O(2ⁱ) time).

How does it do this? Well, in the first iteration it sorts the 2-grams (aka bigrams), and then performs what is called lexicographic renaming. This means it creates a new array (of length n) that stores, for each bigram, its rank in the bigram sorting.

Example for lexicographic renaming: Say we have a sorted list of some bigrams {'ab','ab','ca','cd','cd','ea'}. We then assign ranks (i.e. lexicographic names) by going from left to right, starting with rank 0 and incrementing the rank whenever we encounter a new bigram changes. So the ranks we assign are as follows:
ab : 0
ab : 0   [no change to previous]
ca : 1   [increment because different from previous]
cd : 2   [increment because different from previous]
cd : 2   [no change to previous]
ea : 3   [increment because different from previous]
These ranks are known as lexicographic names.

Now, in the next iteration, we sort 4-grams. This involves a lot of comparisons between different 4-grams. How do we compare two 4-grams? Well, we could compare them character by character. That would be up to 4 operations per comparison. But instead, we compare them by looking up the ranks of the two bigrams contained in them, using the rank table generated in the previous steps. That rank represents the lexicographic rank from the previous 2-gram sort, so if for any given 4-gram, its first 2-gram has a higher rank than the first 2-gram of another 4-gram, then it must be lexicographically greater somewhere in the first two characters. Hence, if for two 4-grams the rank of the first 2-gram is identical, they must be identical in the first two characters. In other words, two look-ups in the rank table are sufficient to compare all 4 characters of the two 4-grams.

After sorting, we create new lexicographic names again, this time for the 4-grams.

In the third iteration, we need to sort by 8-grams. Again, two look-ups in the lexicographic rank table from the previous step are sufficient to compare all 8 characters of two given 8-grams.

And so forth. Each iteration i has two steps:

Sorting by 2ⁱ-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each
Creating new lexicographic names
We repeat this until all 2ⁱ-grams are different. If that happens, we are done. How do we know if all are different? Well, the lexicographic names are an increasing sequence of integers, starting with 0. So if the highest lexicographic name generated in an iteration is the same as n-1, then each 2ⁱ-gram must have been given its own, distinct lexicographic name.



Implementation

Now let's look at the code to confirm all of this. The variables used are as follows: sa[] is the suffix array we are building. pos[] is the rank lookup-table (i.e. it contains the lexicographic names), specifically, pos[k] contains the lexicographic name of the k-th m-gram of the previous step. tmp[] is an auxiliary array used to help create pos[].

I'll give further explanations between the code lines:
void buildSA()
{
    N = strlen(S);

    /* This is a loop that initializes sa[] and pos[].
       For sa[] we assume the order the suffixes have
       in the given string. For pos[] we set the lexicographic
       rank of each 1-gram using the characters themselves.
       That makes sense, right? */
    REP(i, N) sa[i] = i, pos[i] = S[i];

    /* Gap is the length of the m-gram in each step, divided by 2.
       We start with 2-grams, so gap is 1 initially. It then increases
       to 2, 4, 8 and so on. */
    for (gap = 1;; gap *= 2)
    {
        /* We sort by (gap*2)-grams: */
        sort(sa, sa + N, sufCmp);

        /* We compute the lexicographic rank of each m-gram
           that we have sorted above. Notice how the rank is computed
           by comparing each n-gram at position i with its
           neighbor at i+1. If they are identical, the comparison
           yields 0, so the rank does not increase. Otherwise the
           comparison yields 1, so the rank increases by 1. */
        REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);

        /* tmp contains the rank by position. Now we map this
           into pos, so that in the next step we can look it
           up per m-gram, rather than by position. */
        REP(i, N) pos[sa[i]] = tmp[i];

        /* If the largest lexicographic name generated is
           n-1, we are finished, because this means all
           m-grams must have been different. */
        if (tmp[N - 1] == N - 1) break;
    }
}
About the comparison function

The function sufCmp is used to compare two (2*gap)-grams lexicographically. So in the first iteration it compares bigrams, in the second iteration 4-grams, then 8-grams and so on. This is controlled by gap, which is a global variable.

A naive implementation of sufCmp would be this:
bool sufCmp(int i, int j)
{
  int pos_i = sa[i];
  int pos_j = sa[j];

  int end_i = pos_i + 2*gap;
  int end_j = pos_j + 2*gap;
  if (end_i > N)
    end_i = N;
  if (end_j > N)
    end_j = N;

  while (i < end_i && j < end_j)
  {
    if (S[pos_i] != S[pos_j])
      return S[pos_i] < S[pos_j];
    pos_i += 1;
    pos_j += 1;
  }
  return (pos_i < N && pos_j < N) ? S[pos_i] < S[pos_j] : pos_i > pos_j;
}
This would compare the (2*gap)-gram at the beginning of the i-th suffix pos_i:=sa[i] with the one found at the beginning of the j-th suffix pos_j:=sa[j]. And it would compare them character by character, i.e. comparing S[pos_i] with S[pos_j], then S[pos_i+1] with S[pos_j+1] and so on. It continues as long as the characters are identical. Once they differ, it returns 1 if the character in the i-th suffix is smaller than the one in the j-th suffix, 0 otherwise. (Note that return a<b in a function returning int means you return 1 if the condition is true, and 0 if it is false.)

The complicated looking condition in the return-statement deals with the case that one of the (2*gap)-grams is located at the end of the string. In this case either pos_i or pos_j will reach N before all (2*gap) characters have been compared, even if all characters up to that point are identical. It will then return 1 if the i-th suffix is at the end, and 0 if the j-th suffix is at the end. This is correct because if all characters are identical, the shorter one is lexicographically smaller. If pos_i has reached the end, the i-th suffix must be shorter than the j-th suffix.

Clearly, this naive implementation is O(gap), i.e. its complexity is linear in the length of the (2*gap)-grams. The function used in your code, however, uses the lexicographic names to bring this down to O(1) (specifically, down to a maximum of two comparisons):
bool sufCmp(int i, int j)
{
  if (pos[i] != pos[j])
    return pos[i] < pos[j];
  i += gap;
  j += gap;
  return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
As you can see, instead of looking up individual characters S[i] and S[j], we check the lexicographic rank of the i-th and j-th suffix. Lexicographic ranks were computed in the previous iteration for gap-grams. So, if pos[i] < pos[j], then the i-th suffix sa[i] must start with a gap-gram that is lexicographically smaller than the gap-gram at the beginning of sa[j]. In other words, simply by looking up pos[i] and pos[j] and comparing them, we have compared the first gap characters of the two suffixes.

If the ranks are identical, we continue by comparing pos[i+gap] with pos[j+gap]. This is the same as comparing the next gap characters of the (2*gap)-grams, i.e. the second half. If the ranks are indentical again, the two (2*gap)-grams are indentical, so we return 0. Otherwise we return 1 if the i-th suffix is smaller than the j-th suffix, 0 otherwise.



Example

The following example illustrates how the algorithm operates, and demonstrates in particular the role of the lexicographic names in the sorting algorithm.

The string we want to sort is abcxabcd. It takes three iterations to generate the suffix array for this. In each iteration, I'll show S (the string), sa (the current state of the suffix array) and tmp and pos, which represent the lexicographic names.

First, we initialize:
S   abcxabcd
sa  01234567
pos abcxabcd
Note how the lexicographic names, which initially represent the lexicographic rank of unigrams, are simply identical to the characters (i.e. the unigrams) themselves.

First iteration:

Sorting sa, using bigrams as sorting criterion:
sa  04156273
The first two suffixes are 0 and 4 because those are the positions of bigram 'ab'. Then 1 and 5 (positions of bigram 'bc'), then 6 (bigram 'cd'), then 2 (bigram 'cx'). then 7 (incomplete bigram 'd'), then 3 (bigram 'xa'). Clearly, the positions correspond to the order, based solely on character bigrams.

Generating the lexicographic names:
tmp 00112345
As described, lexicographic names are assigned as increasing integers. The first two suffixes (both starting with bigram 'ab') get 0, the next two (both starting with bigram 'bc') get 1, then 2, 3, 4, 5 (each a different bigram).

Finally, we map this according to the positions in sa, to get pos:
sa  04156273
tmp 00112345
pos 01350124
(The way pos is generated is this: Go through sa from left to right, and use the entry to define the index in pos. Use the corresponding entry in tmp to define the value for that index. So pos[0]:=0, pos[4]:=0, pos[1]:=1, pos[5]:=1, pos[6]:=2, and so on. The index comes from sa, the value from tmp.)

Second iteration:

We sort sa again, and again we look at bigrams from pos (which each represents a sequence of two bigrams of the original string).
sa  04516273
Notice how the position of 1 5 have switched compared to the previous version of sa. It used to be 15, now it is 51. This is because the bigram at pos[1] and the bigram at pos[5] used to be identical (both bc) in during the previous iteration, but now the bigram at pos[5] is 12, while the bigram at pos[1] is 13. So position 5 comes before position 1. This is due to the fact that the lexicographic names now each represent bigrams of the original string: pos[5] represents bc and pos[6] represents 'cd'. So, together they represent bcd, while pos[1] represents bc and pos[2] represents cx, so together they represent bcx, which is indeed lexicographically greater than bcd.

Again, we generate lexicographic names by screening the current version of sa from left to right and comparing the corrsponding bigrams in pos:
tmp 00123456
The first two entries are still identical (both 0), because the corresponding bigrams in pos are both 01. The rest is an strictly increasing sequence of integers, because all other  bigrams in pos are each unique.

We perform the mapping to the new pos as before (taking indices from sa and values from tmp):
sa  04516273
tmp 00123456
pos 02460135
Third iteration:

We sort sa again, taking bigrams of pos (as always), which now each represents a sequence of 4 bigrams of the orginal string.
sa  40516273
You'll notice that now the first two entries have switched positions: 04 has become 40. This is because the bigram at pos[0] is 02 while the one at pos[4] is 01, the latter obviously being lexicographically smaller. The deep reason is that these two represent abcx and abcd, respectively.

Generating lexicographic names yields:
tmp 01234567
They are all different, i.e. the highest one is 7, which is n-1. So, we are done, because are sorting is now based on m-grams that are all different. Even if we continued, the sorting order would not change.



Improvement suggestion

The algorithm used to sort the 2ⁱ-grams in each iteration appears to be the built-in sort (or std::sort). This means it's a comparison sort, which takes O(n log n) time in the worst case, in each iteration. Since there are log n iterations in the worst case, this makes it a O(n (log n)²)-time algorithm. However, the sorting could by performed using two passes of bucket sort, since the keys we use for the sort comparison (i.e. the lexicographic names of the previous step), form an increasing integer sequence. So this could be improved to an actual O(n log n)-time algorithm for suffix sorting.



Remark

I believe this is the original algorithm for suffix array construction that was suggested in the 1992-paper by Manber and Myers (link on Google Scholar; it should be the first hit, and it may have a link to a PDF there). This (at the same time, but independently of a paper by Gonnet and Baeza-Yates) was what introduced suffix arrays (also known as pat arrays at the time) as a data structure interesting for further study.

Modern algorithms for suffix array construction are O(n), so the above is no longer the best algorithm available (at least not in terms of theoretical, worst-case complexity).



Footnotes

^(*) By 2-gram I mean a sequence of two consecutive characters of the original string. For example, when S=abcde is the string, then ab, bc, cd, de are the 2-grams of S. Similarly, abcd and bcde are the 4-grams. Generally, an m-gram (for a positive integer m) is a sequence of m consecutive characters. 1-grams are also called unigrams, 2-grams are called bigrams, 3-grams are called trigrams. Some people continue with tetragrams, pentagrams and so on.

Note that the suffix of S that starts and position i, is an (n-i)-gram of S. Also, every m-gram (for any m) is a prefix of one of the suffixes of S. Therefore, sorting m-grams (for an m as large as possible) can be the first step towards sorting suffixes.

                        这篇关于后缀阵列算法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

后缀阵列算法 [英] Suffix Array Algorithm

问题描述

概述

实施

示例

改进建议

脚注

Overview

Implementation

Example

Improvement suggestion

Remark

Footnotes

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

后缀阵列算法 [英] Suffix Array Algorithm

问题描述

概述

实施

示例

改进建议

脚注

Overview

Implementation

Example

Improvement suggestion

Remark

Footnotes

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭