这是如何code从一个后缀数组获得LCP工作? [英] How does this code for obtaining LCP from a Suffix Array work?

查看:152
本文介绍了这是如何code从一个后缀数组获得LCP工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人能解释这种code从后缀阵列工程建设LCP怎么样? suffixArr [] 是一个数组,这样 suffixArr [I] 持有的指数值的字符串为后缀秩的的。

 无效LCPconstruct()
{
    INT I,C [1001]升;
    C [suffixArr [0] = N;


    为(ⅰ= 1; I&n种;我+ +)
    C [suffixArr [I] = suffixArr [I-1];

    升= 0;

   对于(I = 0; I&n种;我+ +)
   {
    如果(C [I] == N)
        LCPadj [I] = 0;
    其他
    {
        而第(i + L&n种安培;和C [I] + 1&n种安培;&安培; S [I + 1] == S [C [i]于+ 1])
            升++;
        LCPadj [I] = 1;

        升=最大值(1-1,0);
    }
  }

  对于(I = 0; I&n种;我+ +)
     的cout&其中;&其中; LCPadj [suffixArr〔Ⅰ〕25;&所述;\ N的;


}
 

解决方案

首先,要认识到,算法处理后缀原来的顺序,也就是它们出现在输入字符串的顺序很重要。不要在字典顺序。

因此​​,如果输入字符串为 abxabc ,它首先考虑 abxabc ,然后 bxabc ,然后 xabc 等等。

有关每个后缀其认为在这个命令,它确定后缀,即它的词典编纂predecessor的位置(*)功能(因此在这里,只有在这里,它采用的概念字典序)。对于第一个后缀 abxabc ,辞书predecessor,即直接出现之前它后缀的词典式排序的后缀,是农行。它在阵列中确定该由一个O(1)查找 C ,已专门为此ppared $ P $。

内环 abxabc 的人物比较和农行一个接一个,并确定这两个后缀有第2个字符的共同点。这是在code中的变量,这意味着为后缀的LCP进入 abxabc 必须是2,所以我们设置 LCPadj [I] = L 。注意, I 这里指的是后缀的输入字符串中的位置,而不是它的后缀数组中的位置。因此, LCPadj 不是LCP阵列(还)。这是一个辅助的数据结构。

然后前进到下一个字符串,它是 bxabc 。此外它采用 C 找到 BC 是该辞书predecessor,然后确定$多少p $ PFIX字符两股。这里谈到的伎俩:它可以肯定,这一定是的至少的多达在previous步骤(即2),再减去1。为什么?因为当前我们认为, bxabc ,字符串是当然的字符串previously考虑的后缀(abxabc <$ C C $> )中,该字符串(因此辞书predecessor ​​ABC )也必须有一个后缀,即1个字符短( BC ),并且后缀也必须某处后缀数组中,它必须与当前考虑的串分享其preFIX,减去的第一个字符。此外,不能有任何其他后缀,既短和字典顺序更接近当前考虑的串。后者是,如果你想进行排序如何辞书的作品很合乎逻辑的,但也有这个(如引理5.10的Kärkkäinen讲座形式证明的这里

这样描述工作中的主要原理在这里。有几件事情要注意你的code充分理解每个变量的作用:

  • 如前所述, C 是一个辅助阵列( N 整数长度)的商店,在每个后缀输入字符串,即其他后缀,即它的直接辞书predecessor的位置。此数组构造不会从经过后缀列由左到右从左向右,但(明智地),因为这可以很容易地确定任何字符串的直接lexicogaphical predecessor:立即辞书predecessor从位置后缀的 suffixArr [I] 当然必须位于位置 suffixArr [I-1] 。确认您的code,这是多么 C 定义。
  • 如上所述,为后缀的顺序它们出现在输入字符串 LCPadj 商店LCP值,而不是为了在它们出现的后缀数组中。也就是说,而是由经历后缀列由左到右,和印刷 LCPadj为什么在输出时间 LCPadj 不是从左至右打印[我] 的顺序。证实,这是这种情况。

我希望这有助于。让我知道,如果没有。


(*)通过的辞书predecessor ​​的我的意思的直接的后缀中的字典顺序排序列表的predecessor后缀,即后缀为它的直接左边的后缀数组中开始。

Can someone explain how this code for constructing the LCP from a suffix array works? suffixArr[] is an array such that suffixArr[i] holds the value of the index in the string for the suffix with rank i.

 void LCPconstruct()
{
    int i,C[1001],l;
    C[suffixArr[0]] = n;


    for(i=1;i<n;i++)
    C[suffixArr[i]] = suffixArr[i-1];

    l = 0;

   for(i=0;i<n;i++)
   {
    if(C[i]==n)
        LCPadj[i] = 0;
    else
    {
        while(i+l<n && C[i]+l<n && s[i+l] == s[C[i]+l])
            l++;
        LCPadj[i] = l;

        l = max(l-1,0);
    }
  }

  for(i=0;i<n;i++)
     cout<<LCPadj[suffixArr[i]]<<"\n";


}

解决方案

First, it's important to realize that the algorithm processes the suffixes in the original order, i.e. the order in which they appear in the input string. Not in lexicographical order.

So if the input string is abxabc, it first considers abxabc, then bxabc, then xabc and so forth.

For each suffix it considers in this order, it determines the position of the suffix that is its lexicographical predecessor(*) (so here, and only here, it uses the concept of lexicographical order). For the first suffix abxabc, the lexicographical predecessor, i.e. the suffix that appears directly before it in the lexicographical ordering of suffixes, is abc. It determines this by an O(1) lookup in the array C, which has been prepared specifically for this purpose.

The inner loop compares the characters of abxabc and abc one by one and determines that these two suffixes have the first 2 characters in common. This is the variable l in your code, and it means that the entry in LCP for the suffix abxabc must be 2, so we set LCPadj[i] = l. Note that i here refers to the position of the suffix in the input string, not to its position in the suffix array. So LCPadj is not the LCP array (yet). It's an auxiliary data structure.

Then it proceeds to the next string, which is bxabc. Again it uses C to find that bc is the lexicographical predecessor of that and then determines how many prefix characters the two share. And here comes the trick: It can be sure that this must be at least as many as in the previous step (i.e. 2), minus 1. Why? Because the string we currently consider, bxabc, is of course a suffix of the string previously considered (abxabc), therefore the lexicographical predecessor of that string (abc) must also have a suffix that is 1 character shorter (bc), and that suffix must also be somewhere in the suffix array, and it must share its prefix with the currently considered string, minus the first character. Moreover, there can't be any other suffix that is both shorter and lexicographically closer to the string currently considered. The latter is quite logical if you think about how lexicographical sorting works, but there are also formal proofs of this (e.g. Lemma 5.10 in Kärkkäinen's lecture here)

So that describes the main principle at work here. There are a few things to note about your code to fully understand the role of each variable:

  • As explained, C is an auxiliary array (n integers in length) that stores, for each suffix in the input string, the position of that other suffix that is its immediate lexicographical predecessor. This array is constructed not from left to right, but (wisely) by going through the suffix array from left to right, because that makes it easy to determine the immediate lexicogaphical predecessor of any string: The immediate lexicographical predecessor of the suffix starting at position suffixArr[i] must of course be located at position suffixArr[i-1]. Confirm in your code that this is how C is defined.
  • As mentioned above, LCPadj stores LCP values for the suffix in the order in which they appear in the input string, not the order in which they appear in the suffix array. That is why at output time, LCPadj is not printed from left to right, but by going through the suffix array from left to right, and printing LCPadj[i] in that order. Confirm that this is the case.

I hope this helps. Let me know if not.


(*)By lexicographical predecessor I mean the immediate predecessor of the suffix in the lexicographically ordered list of suffixes, i.e. the suffix to its immediate left in the suffix array.

这篇关于这是如何code从一个后缀数组获得LCP工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆