为什么我们在后缀树中需要一个定点字符? [英] Why do we need a sentinel character in a Suffix Tree?

查看:97
本文介绍了为什么我们在后缀树中需要一个定点字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们实现(如果允许使用非分支内部节点,则证明这一点)。



这也意味着在实践中,您可能会使用不同的方式来处理作为其他后缀的前缀的后缀以及非分支内部节点。例如,如果您使用众所周知的Ukkonen算法来构造树,则可以在不将唯一字符附加到末尾的情况下进行操作。您只需要确保在最后一次迭代之后,将非分支内部节点放在每个隐式后缀(即,每个后缀在一条边的中间)的末尾即可。



再次,可能还有进一步的非常具体的原因,在构造之前将 $ 放在文本末尾后缀树或数组。例如,在基于DC(差异覆盖)原理的后缀数组构造算法中,必须在字符串的末尾放置两个 $ 符号,以确保字符串的最后一个字符是完整字符三字组的一部分,因为该算法基于三字组排序。此外,在某些情况下,必须以特殊方式解释唯一的 $ 字符。对于Ukkonen构造算法, $ 是唯一的;对于DC后缀数组算法,除唯一性外, $ 在字典上比所有其他字符都要小,并且在基于后缀树的圆字符串切割算法中(最近在此处中提到),实际上有必要解释 $ 作为字典上最大的字符。


Why do we need to append "$" to the original string when we implement a suffix tree?

解决方案

There can be special reasons for appending one (or even more) special characters to the end of the string when specific construction algorithms are used – both in the case of suffix trees and suffix arrays.

However, the most fundamental underlying reason in the case of suffix trees is a combination of two properties of suffix trees:

  1. Suffix trees are PATRICIA trees, i.e. the edge labels are, unlike the edge labels of tries, strings consisting of one or more characters
  2. Internal nodes exist only at branching points

This means you can potentially have a situation where one edge label is a prefix of another:

The idea here is that the black node on the right is a leaf node, i.e. a suffix ends here. But if the text has a suffix aa, then the single character a must also be a suffix. But there is no way for us to store the information that a suffix ends after the first a, because aa forms one continuous edge of the tree (property 1 above). We would have to introduce an intermediate node in which we could store the information, like this:

But this would be illegal because of property 2: No inner node must exist unless there is a branching point.

The problem is solved if we can guarantee that the last character of the text is a character that occurs nowhere else in the entire string. The dollar sign is normally used as a symbol for that.

Clearly, if the last character occurs nowhere else, there can't possible be any repetition (such as aa, or even a more complex one like abcabc) at the end of the string, hence the problem of non-branching inner nodes does not occur. In the example above, the effect of putting $ at the end of the string is:

There are three suffixes now: aa$, a$ and $, but none is a prefix of another. Obviously, this means we need to introduce an inner node after all, and there are a total of three leaves now. So, to be sure, the advantage of this is not that we save space or anything becomes more efficient. It's just a way to guarantee the two properties above. These properties are important when we prove certain useful characteristics of suffix trees, including the fact that its number of inner nodes is linear in the length of the string (you could not prove this if non-branching inner nodes were allowed).

This also means that in practice, you might use different ways of dealing with suffixes that are prefixes of other suffixes, and with non-branching inner nodes. For example, if you use the well-known Ukkonen algorithm to construct the tree, you can do that without appending a unique character to the end; you just have to make sure that at the end, after the final iteration, you put non-branching inner nodes to the end of every implicit suffix (i.e. every suffix that ends in the middle of an edge).

Again, there can be further, and very specific reasons for putting $ at the end of text before constructing a suffix tree or array. For example, in construction algorithms for suffix arrays that are based on the DC (difference cover) principle, you must put two $ signs to the end of the string to ensure that even the last character of the string is part of a complete character trigram, as the algorithm is based on trigram sorting. Furthermore, there are specific situations when the unique $ character must be interpreted in a special way. For the Ukkonen construction algorithm, it is sufficient for $ to be unique; for the DC suffix array algorithms it is necessary, in addition to uniqueness, that $ is lexicographically smaller than all other characters, and in the suffix-tree based circular string cutting algorithm (mentioned recently here) it is actually necessary to interpret $ as the lexicographically largest character.

这篇关于为什么我们在后缀树中需要一个定点字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆