Ukkonen后缀树:过程'册封'不明确 [英] Ukkonen suffix tree: procedure 'canonize' unclear

查看:131
本文介绍了Ukkonen后缀树:过程'册封'不明确的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

怎样的'推崇'功能(以下给出,从Ukkonen的纸)的工作,并且特别地,当while循环完成了吗?我认为,p值 - K'将永远是小于的P - K。我是对还是错?

 程序推崇(S,(K,P)):
1.如果P< 10k,则返回(S,K)
2.其他
3.找到tk的过渡克'(S,(K',P'))= s'的从s;
4.而P' -  K'< = P  -  K做
5. K = K + P' -  K'+ 1;
6. S = S';
7.如果k&其中; = P然后找到的TK-过渡克'(S,(K',P'))= s'的从s;
8.回报(S,K)。
 

解决方案

什么推崇函数的作用是在<一个尽头描述href="http://stackoverflow.com/questions/9452701/can-someone-please-explain-ukkonens-suffix-tree-algorithm-in-plain-english/9513423#9513423">this SA帖子,我们认为这样的情况:

的情况是这样的:

  1. 活动点是(红色,'D',3),即三个字符到 DEFG 边缘走出去红节点。

  2. 现在,我们遵循一个后缀链接到绿色节点。从理论上讲,我们的活动节点现在(绿色,D,3)

  3. 不幸的是这一点不存在,因为边缘走出去绿点已经得到了只有2个字符。 因此,我们运用推崇功能

它的工作原理是这样的:

  1. 我们感兴趣的边缘的起始字符是 D 。此字符称t <子> K 在Ukkonen的符号。因此,发现的T <子> K -edge是指找到边缘的绿色节点。

  2. 这种优势是仅有的两个字符长。即(P' - K')== 2 在Ukkonen的符号。但原边有三个特点:(P - K)== 3 。因此,&LT;。= 是真实的,我们进入循环

  3. 我们缩短我们正在寻找的边缘从 DEF F 。这是点什么:= P +(K' - P')+ 1 一步确实

  4. 我们进到状态边缘点,即蓝色状态。这就是 S:= S'确实

  5. 由于剩余部分 F 边缘不为空( K&LT; = P ) ,我们确定了相关的出边(也就是 FG 边走出去的蓝色节点)。此步骤设置数k和p'完全新的值,因为它们现在参阅字符串 FG ,系统和它的长度(对' - K')现在将是2。

  6. 剩余的边缘 F ,长度(P - K),现在是1,和候选边的长度 FG 新活动点,(P' - K'),是2。因此,循环条件

    而(P' - K')&LT; =(P - K)做

不再是真实的,所以循环结束,确实是新的(正确的)活动点是(蓝,'F',1)

[实际上,在Ukkonen的符号,一个边缘点的结束指针p的边缘的最后字符,而不是它后面的位置的位置。因此,严格来说,(对 - k)为0,而不是1,和(p' - K')是1,而不是2。但重要的是不长的绝对值,而是两个不同的相对的比较长度。]

最后几点注意事项:

  • 指针像P,K的参考位置在原来的输入文本吨。这可以是相当混乱。例如,在绿色节点用于边缘指针将指向的部分的子 T,以及用于 FG 边缘的蓝色节点参考的部分的子 FG指针 T的。虽然字符串 DEFG 必须作为一个连续字符串在T某处,则子 FG 可能出现在其他地方,太。因此, FG 边缘的指针k是不一定的结束指针p的边缘的加一

  • 什么时候我们决定是否结束循环的计数,因此是不绝对位置k或对,但长度(对 - k)的其余边缘的相对于长度(对' - 数k )当前候选边缘。

  • 在你的问题中,code段第4行,有一个错字:它应该是 K'而不是 K;

How does the 'canonize' function (given below, from Ukkonen's paper) work, and in particular, when is the while loop finished? I think the value of p' - k' will always remain less than that of p - k. Am I right or wrong?

procedure canonize(s, (k, p)):
1. if p < k then return (s, k)
2. else
3. find the tk–transition g'(s, (k', p')) = s' from s;
4. while p' − k' <= p − k do
5.     k  = k + p' − k' + 1;
6.     s  = s';
7.     if k <= p then find the tk–transition g'(s, (k', p')) = s' from s;
8. return (s, k).

解决方案

What the canonize function does is what is described at the very end of this SA post, where we consider a situation like this:

The situation is this:

  1. The active point is at (red,'d',3), i.e. three characters into the defg edge going out of the red node.

  2. Now we follow a suffix link to the green node. In theory, our active node is now (green,'d',3).

  3. Unfortunately that point does not exist, because the de edge going out of the green node has got only 2 characters. Hence, we apply the canonize function.

It works like this:

  1. The starting character of the edge we are interested in is d. This character is referred to as tk in Ukkonen's notation. So, "finding the tk-edge" means finding the de edge at the green node.

  2. That edge is only two characters in length. I.e. (p' - k') == 2 in Ukkonen's notation. But the original edge had three characters: (p - k) == 3. So <= is true and we enter the loop.

  3. We shorten the edge we are looking for from def to f. This is what the p := p + (k' - p') + 1 step does.

  4. We advance to the state the de edge points to, i.e. the blue state. That is what s := s' does.

  5. Since the remaining part f of the edge is not empty (k <= p), we identify the relevant outgoing edge (that is the fg edge going out of the blue node). This step sets k' and p' to entirely new values, because they now refer to the string fg, and its length (p' - k') will now be 2.

  6. The length of the remaining edge f, (p - k), is now 1, and the length of the candidate edge fg for the new active point, (p' - k'), is 2. Hence the loop condition

    while (p' - k') <= (p - k) do

is no longer true, so the loop ends, and indeed the new (and correct) active point is (blue,'f',1).

[Actually, in Ukkonen's notation, the end pointer p of an edge points to the position of the final character of the edge, not the position that follows it. Hence, strictly speaking, (p - k) is 0, not 1, and (p' - k') is 1, not 2. But what matters is not the absolute value of the length, but the relative comparison of the two different lengths.]

A few final notes:

  • Pointers like p and k refer to positions in the original input text t. That can be quite confusing. For example, the pointers used in the de edge at the green node will refer to some substring de of t, and the pointers used in the fg edge at the blue node will refer to some substring fg of t. Although the string defg must appear as one continuous string somewhere in t, the substring fg might appear in other places, too. So, the pointer k of the fg edge is not necessarily the end pointer p of the de edge plus one.

  • What counts when we decide whether or not to end the loop is therefore not the absolute positions k or p, but the length (p - k) of the remaining edge compared to the length (p' - k') of the current candidate edge.

  • In your question, line 4 of the code snippet, there is a typo: It should be k' instead of k;.

这篇关于Ukkonen后缀树:过程'册封'不明确的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆