如何从后缀数组获取LCP的代码如何工作? [英] How does this code for obtaining LCP from a Suffix Array work?

查看:149
本文介绍了如何从后缀数组获取LCP的代码如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释如何从后缀数组构造LCP的代码如何工作? suffixArr [] 是一个数组,使得 suffixArr [i] 保存后缀字符串中索引的值与排名 i

  void LCPconstruct()
{
int i,C [1001],l;
C [suffixArr [0]] = n;


for(i = 1; i C [suffixArr [i]] = suffixArr [i-1]

l = 0;

for(i = 0; i {
if(C [i] == n)
LCPadj [i] = 0;
else
{
while(i + 1< n&& C [i] + 1< n& s [i + 1] == s [C [i ] + l])
l ++;
LCPadj [i] = l;

l = max(l-1,0);
}
}

for(i = 0; i cout ;


}


解决方案

<首先,重要的是要意识到算法以原始顺序处理后缀,即它们在输入字符串中出现的顺序。因此,如果输入字符串 abxabc ,它首先会考虑

abxabc ,然后 bxabc ,然后 xabc 等。



对于它在这个顺序中考虑的每个后缀,它确定后缀的位置,它是它的字典前缀(*)(所以这里,字典顺序的概念)。对于第一个后缀 abxabc ,词典前缀,即后缀在后缀的词典顺序中出现的后缀是 abc 。它通过在专门为此目的准备的数组 C 中的O(1)查找来确定。



内部循环逐个比较 abxabc abc 的字符,并确定这两个后缀具有第一个2个字符共同。这是您的代码中的变量 l ,这意味着LCP中后缀 abxabc 的条目必须为2 ,因此我们设置 LCPadj [i] = l 。注意, i 这里指的是后缀在输入字符串中的位置,而不是它在后缀数组中的位置。因此 LCPadj 不是LCP数组(尚未)。这是一个辅助数据结构。



然后它继续到下一个字符串,它是 bxabc 。再次,它使用 C 找到 bc 是它的词典前身,然后确定两个共享。这里来的诀窍:它可以肯定这必须至少多于上一步(即2),减1.为什么?因为我们当前考虑的字符串 bxabc 当然是以前考虑的字符串的后缀( abxabc ),因此该字符串的字典式前身( abc )也必须有一个小于1个字符的后缀( bc ),该后缀也必须在后缀数组中的某处,并且它必须与当前考虑的字符串共享其前缀,减去第一个字符。此外,不能有任何其他后缀,更短和词典更接近当前考虑的字符串。如果你想到字典排序如何工作,后者是相当合乎逻辑的,但也有正式的证据(例如Lemma 5.10在Kärkkäinen的演讲这里



这里描述了工作的主要原则。有关代码的一些注意事项,以便充分了解每个变量的作用:




  • 如前所述, C 是一个辅助数组(长度为 n 整数),用于为输入字符串中的每个后缀存储该另一个后缀的位置它的直接词典前身。这个数组不是从左到右构建,而是(明智地)从左到右穿过后缀数组,因为这使得很容易确定任何字符串的直接词典前导:从位置开始的后缀的立即词典前导 suffixArr [i] 当然必须位于 suffixArr [i-1] 。在您的代码中确认这是如何定义 C

  • 如上所述, LCPadj 按照它们在输入字符串中出现的顺序存储后缀的LCP值,而不是它们在后缀数组中出现的顺序。这就是为什么在输出时, LCPadj 不是从左到右打印,而是通过从左到右浏览后缀数组,并打印 LCPadj [i] 。确认是这种情况。



我希望这有助于。






(*) em>我的意思是后缀的字典顺序列表中的后缀的后缀,即后缀数组中其后面的后缀。


Can someone explain how this code for constructing the LCP from a suffix array works? suffixArr[] is an array such that suffixArr[i] holds the value of the index in the string for the suffix with rank i.

 void LCPconstruct()
{
    int i,C[1001],l;
    C[suffixArr[0]] = n;


    for(i=1;i<n;i++)
    C[suffixArr[i]] = suffixArr[i-1];

    l = 0;

   for(i=0;i<n;i++)
   {
    if(C[i]==n)
        LCPadj[i] = 0;
    else
    {
        while(i+l<n && C[i]+l<n && s[i+l] == s[C[i]+l])
            l++;
        LCPadj[i] = l;

        l = max(l-1,0);
    }
  }

  for(i=0;i<n;i++)
     cout<<LCPadj[suffixArr[i]]<<"\n";


}

解决方案

First, it's important to realize that the algorithm processes the suffixes in the original order, i.e. the order in which they appear in the input string. Not in lexicographical order.

So if the input string is abxabc, it first considers abxabc, then bxabc, then xabc and so forth.

For each suffix it considers in this order, it determines the position of the suffix that is its lexicographical predecessor(*) (so here, and only here, it uses the concept of lexicographical order). For the first suffix abxabc, the lexicographical predecessor, i.e. the suffix that appears directly before it in the lexicographical ordering of suffixes, is abc. It determines this by an O(1) lookup in the array C, which has been prepared specifically for this purpose.

The inner loop compares the characters of abxabc and abc one by one and determines that these two suffixes have the first 2 characters in common. This is the variable l in your code, and it means that the entry in LCP for the suffix abxabc must be 2, so we set LCPadj[i] = l. Note that i here refers to the position of the suffix in the input string, not to its position in the suffix array. So LCPadj is not the LCP array (yet). It's an auxiliary data structure.

Then it proceeds to the next string, which is bxabc. Again it uses C to find that bc is the lexicographical predecessor of that and then determines how many prefix characters the two share. And here comes the trick: It can be sure that this must be at least as many as in the previous step (i.e. 2), minus 1. Why? Because the string we currently consider, bxabc, is of course a suffix of the string previously considered (abxabc), therefore the lexicographical predecessor of that string (abc) must also have a suffix that is 1 character shorter (bc), and that suffix must also be somewhere in the suffix array, and it must share its prefix with the currently considered string, minus the first character. Moreover, there can't be any other suffix that is both shorter and lexicographically closer to the string currently considered. The latter is quite logical if you think about how lexicographical sorting works, but there are also formal proofs of this (e.g. Lemma 5.10 in Kärkkäinen's lecture here)

So that describes the main principle at work here. There are a few things to note about your code to fully understand the role of each variable:

  • As explained, C is an auxiliary array (n integers in length) that stores, for each suffix in the input string, the position of that other suffix that is its immediate lexicographical predecessor. This array is constructed not from left to right, but (wisely) by going through the suffix array from left to right, because that makes it easy to determine the immediate lexicogaphical predecessor of any string: The immediate lexicographical predecessor of the suffix starting at position suffixArr[i] must of course be located at position suffixArr[i-1]. Confirm in your code that this is how C is defined.
  • As mentioned above, LCPadj stores LCP values for the suffix in the order in which they appear in the input string, not the order in which they appear in the suffix array. That is why at output time, LCPadj is not printed from left to right, but by going through the suffix array from left to right, and printing LCPadj[i] in that order. Confirm that this is the case.

I hope this helps. Let me know if not.


(*)By lexicographical predecessor I mean the immediate predecessor of the suffix in the lexicographically ordered list of suffixes, i.e. the suffix to its immediate left in the suffix array.

这篇关于如何从后缀数组获取LCP的代码如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆