Ukkonen的通用后缀树算法 [英] Ukkonen's algorithm for Generalized Suffix Trees

查看:95
本文介绍了Ukkonen的通用后缀树算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在开发自己的后缀树实现(使用C ++,但问题仍然与语言无关)。我研究了 Ukkonen的原始论文。这篇文章很清楚,所以我开始进行实现,并尝试解决通用后缀树的问题。

I am currently working on my own Suffix Tree implementation (using C++, but the question remains language agnostic). I studied the original paper from Ukkonen. The article is very clear so I got to work on my implementation and tried to tackle the problem for Generalized Suffix Trees.

在树中,每个子字符串从节点到另一个用一对整数表示。尽管对于常规后缀树而言这很简单,但是当多个字符串共存于同一棵树中时(会成为通用后缀树),就会出现问题。确实,现在这样的一对还不够,我们需要另一个变量来说明我们正在使用哪个引用字符串。

In the tree, each substring leading from a node to another is represented using a pair of integer. While this is straightforward for a regular suffix tree, a problem arises when multiple strings coexist in the same tree (which becomes a Generalized Suffix Tree). Indeed, now such a pair is not sufficient, we need another variable to state which reference string we are using.

一个简单的例子。考虑字符串椰子

A quick example. Consider the string coconut:


  • 子字符串 nut (4,6)

  • 我们添加了麻烦制造者在树中,(4,6)现在可以是:


    • nut 如果引用第一个字符串。

    • ble 如果引用第一个字符串第二个字符串。

    • The substring nut would be (4,6).
    • We add troublemaker in the tree, (4,6) can now be:
      • nut if we refer to the first string.
      • ble if we refer to the second string.

      要处理此问题,我想添加一个代表字符串的id :

      To handle this, I thought adding an id representing the string:

      // A pair of int is a substring (regular tree)
      typedef std::pair<int,int> substring;
      // We need to bind a substring to its reference string:
      typedef std::pair<int, substring> mapped_substring;
      

      我目前面临的问题如下:

      The problem I currently face is the following:

      我得到一个在树中添加字符串的查询。在算法期间,我可能必须检查与其他已注册字符串相关的现有转换,这些转换以三元组(参考字符串ID k p )。某些更新操作基于子字符串索引,如何在这种情况下执行

      I get a query to add a string in the tree. During the algorithm, I may have to inspect existing transitions related to other registered strings, represented as a triplet (reference string id, k, p). Some updating operations are based on the substrings indexes, how can I perform them in such conditions?

      注意:问题是语言-不可知的,所以我没有加入c ++

      推荐答案



      TL ; DR



      EDIT(03/2019):我已经重新实现了使用C ++ 17 string_view 代表我的子字符串以及一种确保参考字符串不会四处移动的缓存机制。可以在github上找到更新的版本: https://github.com/Rerito/suffix-tree -v2 。这是我的旧实现(在C ++中)的github 。哦,新功能还包括测试!

      TL;DR

      EDIT (03/2019): I have reworked my implementation to use C++17 string_view to represent my substrings along with a caching mechanism that makes sure reference strings do not move around. The updated version can be found on github: https://github.com/Rerito/suffix-tree-v2. Here is the github for my old implementation (in C++) for the curious minds. Oh and the new take also includes tests!

      实际上不需要修改原始算法即可构建通用后缀树

      The original algorithm doesn't really need to be modified in order to build a Generalized Suffix Tree.

      我的直觉是正确的方法。为了跟上原始算法中使用的技巧,我们确实需要添加对原始字符串的引用。而且,该算法是 online ,这意味着您可以在树上即时添加字符串。

      The hunch I got was the right way. In order to keep up with the tricks used in the original algorithm, we indeed need to add a reference to the original string. Moreover, the algorithm is online, which means you can add strings on-the-fly to the tree.

      假设我们有一个广义后缀树 GST( N )用于字符串( S 1 ,...,S N )。这里的问题是如何使用GST( N )处理GST( N + 1 )的构建过程。

      Suppose we have a Generalized Suffix Tree GST(N) for strings (S1, ..., SN). The problem at hand here is how to handle the building process of GST(N+1), using GST(N).

      在简单情况下(单个后缀树),每个转换都是一对( substring 结束顶点)。 Ukkonen算法的诀窍是使用一对指向原始字符串中适当位置的指针对子字符串进行建模。在这里,我们还需要将这样的子字符串链接到其父 字符串。为此:

      In the simple case (single suffix tree), each transition is a couple (substring, end vertex). The trick in Ukkonen's algorithm is to model a substring using a pair of pointers to the appropriate positions in the original string. Here, we need to also link such a substring to its "parent" string. To do so:


      • 将原始字符串存储在哈希表中,并为它们提供唯一的整数键。

      • 子字符串现在变为:(ID,(左指针,右指针))。因此,我们可以使用ID来获取 O(1)中的原始字符串。

      • Store the original strings in a hash table, giving them a unique integer key.
      • A substring becomes now: (ID, (left pointer, right pointer)). We can thus use ID to fetch the original string in O(1).

      我们称其为a 映射的子字符串。我使用的C ++ typedef是在原始问题中找到的:

      We call this a mapped substring. The C++ typedefs I use are the ones found in my original question:

      // This is a very basic draft of the Node class used
      template <typename C>
      class Node {
          typedef std::pair<int, int> substring;
          typedef std::pair<int, substring> mapped_substring;
          typedef std::pair<mapped_substring, Node*> transition;
          // C is the character type (basically `char`)
          std::unordered_map<C, transition> g; // Called g just like in the article :)
          Node *suffix_link;
      };
      

      您将看到,我们还将保留引用对概念。这次,引用对与过渡一样,将包含一个映射的子字符串

      As you will see, we will keep the reference pair concept as well. This time, the reference pair, just like the transition, will hold a mapped substring.

      注意:与C ++中一样,字符串索引将从0开始。

      我们想将 S N + 1 插入GST( N )。

      GST( N )可能已经有很多节点和过渡。在简单的树中,我们只有根节点和特殊的接收器节点。在这里,我们可能有 S N + 1 的过渡,这些过渡已经通过插入一些先前的字符串而添加了。首先要做的是在过渡中沿着树走下去,只要它与 S N + 1 匹配即可。

      We want to insert SN+1 into GST(N).
      GST(N) may have already a whole lot of nodes and transitions. In a simple tree, we would only have the root and the special sink node. Here, we may have transitions for SN+1 that have already been added through the insertion of some previous strings. The first thing to do is to walk down the tree through the transitions as long as it matches SN+1.

      这样做,我们就以状态 r 结尾。此状态可能是显式的(即我们直接在顶点上结束)或隐式的(在过渡过程中发生不匹配)。我们使用与原始算法相同的概念对这种状态建模:参考对。快速示例:

      By doing so, we end in a state r. This state may be explicit (i.e. we ended right on a vertex) or implicit (a mismatch occurred in the middle of a transition). We use the same concept as in the original algorithm to model such a state: a reference pair. Quick example:


      • 我们要插入 S N + 1 = 香蕉

      • 代表 ba s >明显地存在于GST( N

      • s 中有一个过渡 t 子字符串 nal

      • We want to insert SN+1 = banana
      • A node s representing ba explicitly exists in GST(N)
      • s has a transition t for the substring nal

      当我们沿着树走时,最终字符 l 的过渡 t 。因此,我们得到的隐式状态 r 引用对 s m )表示,其中 m 是映射的子字符串(N + 1,(1,3))

      When we walk down the tree, we end up in the transition t at the character l which is a mismatch. Thus, the implicit state r we get is represented by the reference pair (s, m) where m is the mapped substring (N+1, (1,3)).

      此处, r 是构建香蕉后缀树的算法的第5次迭代的活动点。我们到达该状态的事实恰好意味着 bana 的树已经在GST( N )中构建。

      Here, r is the active point for the 5-th iteration of the algorithm in the building of banana's suffix tree. The fact we got to that state means precisely that the tree for bana is already built in GST(N).

      在此示例中,我们在第5次迭代时恢复算法,以使用<$ c的树为 banan 建立后缀树$ c> bana 。为了不失一般性,我们将声明 r = (s,(N + 1,(k,i-1)) i 是第一个不匹配项的索引。我们确实有 k≤i (egality是 r 的显式状态的同义词)。

      In this example, we resume the algorithm at the 5-th iteration, to build the suffix tree for banan using the tree for bana. Not to lose the generality, we will state that r = (s, (N+1, (k, i-1)), i being the index of the first mismatch. We have indeed k ≤ i (the egality is synonym of r being an explicit state).

      属性:我们可以在迭代 i (在索引插入字符)中恢复Ukkonen的算法来构建GST( N ) ( S N + 1 )中的i )。此迭代的活动点是我们在树上行走得到的状态 r 唯一需要进行的调整就是一些获取操作以解决子字符串。

      Property: We can resume Ukkonen's algorithm to build GST(N) at iteration i (insertion of the character at index i in SN+1). The active point for this iteration is the state r we got by walking down the tree. The only tweaking needed is some fetching operations to resolve the substrings.

      首先,状态 r 的存在意味着中间树 T( N + 1) i-1 也在那里,因此所有步骤都已设置好,我们恢复了算法。

      First, the presence of such a state r implies that the whole states for the intermediate tree T(N+1)i-1 are also there. So all is set up and we resume the algorithm.

      我们需要证明算法中的每个过程仍然有效,有3个这样的su例程:

      We need to prove that each procedure in the algorithm remains valid. There are 3 such subroutines:


      • test_and_split :给定要在当前迭代中插入的字符,测试是否我们需要将一个过渡分为两个单独的过渡,如果当前点是终点。

      • canonize :给定参考对 (n,m),其中 n 是一个顶点, m 是一个映射的子字符串,返回一对 (n',m')表示相同状态,例如 m'是最短的子字符串。

      • update :更新GST( N ),使其具有中间树 T(N + 1) i 在运行结束时。

      • test_and_split: given the character to insert at current iteration, test wether we need to split a transition into two separate transitions and if the current point is the end point.
      • canonize: Given a reference pair (n, m) where n is a vertex and m a mapped substring, returns the couple (n', m') representing the same state such as m' is the shortest possible substring.
      • update: Update GST(N) so that it has all the states for intermediate tree T(N+1)i at the end of the run.

      输入:一个顶点 s ,一个映射的子字符串 m =(l,(k,p))和一个字符 t

      输出:一个布尔值,它指示状态(s,m)是否是当前迭代的终点以及节点 r 重新如果不是终点,则显式显示(s,m)

      Input: A vertex s, a mapped substring m = (l, (k, p)) and a character t.
      Output: A boolean that tells if the state (s, m) is the end point for the current iteration and a node r representing explicitly (s, m) if it is not the end point.

      最简单的情况在前。如果子字符串为空( k> p ),则状态已明确表示。我们只需要测试是否达到终点即可。 在GST中,就像在公共后缀树中一样,每个节点从给定字符开始最多有一个 ALWAYS 过渡。因此,如果从 t ,我们返回true(到达终点),否则返回false。

      The simplest case goes first. If the substring is empty (k > p), the state is already represented explicitly. We just have to test if we reached the end point or not. In a GST, just like in a common suffix tree, there is ALWAYS at most one transition per node starting with a given character. Thus, if there is a transition starting with t, we return true (we reached the end point), otherwise false.

      现在,当k≤p时,是最困难的部分。我们首先需要获取原始字符串表中位于索引 l (*)的字符串 S l

      (l',(k',p'))(分别是 s')是相关的子字符串(分别是节点)到以字符 S l (k) (*)<开头的 s 的过渡 TR / strong>。之所以存在这种过渡,是因为(s,(l,(k,p))表示中间树 T的边界路径上的(现有)隐式状态。 (N + 1) i-1 。此外,我们确保此过渡中的 p-k 首字符匹配

      Now the hard part, when k ≤ p. We first need to fetch the string Sl lying at index l(*) in the original strings table.
      Let (l', (k', p')) (resp. s') be the substring (resp. the node) related to the transition TR of s starting with the character Sl(k) (*). Such a transition exists because (s, (l,(k,p)) represents an (existing) implicit state on the border path of the intermediate tree T(N+1)i-1. Furthermore, we are sure that the p - k first characters on this transition matches.

      我们需要拆分此过渡吗?这取决于此过渡上Δ= p-k + 1 个字符(**)。要测试此字符,我们需要获取哈希表上位于索引 l'处的字符串,并获取位于索引 k'+Δ处的字符。之所以保证存在此字符是因为我们正在检查的状态是隐式的,因此在过渡 TR (Δ≤p'-k')的中间结束。

      Do we need to split this transition? That depends on the Δ = p - k + 1-th character on this transition (**). To test this character, we need to fetch the string lying at index l' on the hash table and get the character at index k' + Δ. This character is guaranteed to exist because the state we are examining is implicit and thus ends in the middle of the transition TR (Δ ≤ p' - k').

      如果等式成立,我们什么都不做,返回true(终点在这里),什么也不做;否则,我们必须拆分过渡并创建一个新状态 r TR 现在变为(l',(k',k'+Δ-1))→ r 。为 r 创建了另一个过渡:(l',(k'+Δ,p')→s'。我们现在返回false和 r

      If the equality holds, we have nothing to do and return true (the end point is here) and do nothing else. If not, then we must split the transition and create a new state r. TR now becomes (l', (k', k' + Δ - 1)) → r. Another transition is created for r: (l', (k' + Δ, p') → s'. We now return false and r.

      (*) l 不一定等于 N + 1 。同样, l l'可能不同(或相等)。

      (*): l is not necessarily equal to N+1. Likewise, l and l' may be different (or equal).

      (* *):请注意,数字Δ= p-k + 1 完全不依赖于选择作为映射子字符串参考的字符串,而仅取决于

      (**): Notice that the number Δ = p - k + 1 does not depend at all on the string chosen as a reference for the mapped substring. It only depends on the implicit state fed to the routine.

      输入: >节点_s_和映射的子字符串(l,(k,p))表示树中的现有状态 e

      输出:节点 s'和映射的子字符串(l',(k',p'))表示状态 e的规范引用

      Input: A node _s_and a mapped substring (l,(k,p)) representing an existing state e in the tree.
      Output: A node s' and a mapped substring (l',(k',p')) representing the canonical reference for the state e

      使用相同的提取调整,我们只需要沿着树走下去,直到我们用尽了字符 pool。在这里就像 test_and_split 那样,每个转换都具有唯一性,而我们拥有输入的现有状态这一事实为我们提供了有效的程序。

      Using the same fetching tweaks, we just have to walk down the tree until we have exhausted the character "pool". Here, just like for test_and_split the unicity of each transition and the fact we have an existing state as input provides us with a valid procedure.

      输入:当前迭代的活动点和索引。

      输出:下一个迭代的活动点。

      Input: The active point and the index for the current iteration.
      Output: The active point for the next iteration.

      update 都使用 canonize test_and_split 是对GST友好的。后缀链接编辑与普通树的编辑完全相同。唯一需要注意的是,我们将使用 S N + 1 作为参考来创建开放式过渡(即通向节点的过渡)串。因此,在迭代 i 时,过渡将始终链接到映射的子字符串(N + 1,(i,∞))

      update uses both canonize and test_and_split which are GST-friendly. The suffix link editing is exactly the same as the one for a common tree. The only thing to notice is that we will create the open transitions (i.e. the transitions leading to nodes) using SN+1 as the reference string. Thus, at iteration i, the transition will always be linked to the mapped substring (N+1,(i,∞))

      我们需要关闭 开放式过渡 。为此,我们只需遍历它们并编辑∞,用 L-1 代替,其中 L S N +的长度1 。我们还需要一个标记字符来标记字符串的结尾。我们肯定不会在任何字符串中遇到的字符。这样,叶子将永远保留着叶子。

      We need to "close" the open transitions. To do so, we just traverse them and edit the ∞ away, replacing it with L-1, where L is the length of SN+1. We also need a sentinel character to mark the end of the string. A character we are sure to never meet in any string. This way, the leaves will remain leaves forever.

      额外的提取工作增加了一些 O(1)运算,稍微增加了复杂度的常数。但是,渐进复杂度显然与插入字符串的长度保持线性关系。因此,从字符串(S 1 ,...,S N 长度构建GST( N n 1 ,...,n N 是:

      The extra fetching work adds a few O(1) operations, increasing a little bit the constant factor of the complexity. Though, the asymptotic complexity remains obviously linear with the length of the inserted strings. Thus, building GST(N) from strings (S1,..., SN) with length n1,...,nN is:


      c(GST( N ))= Σ i = 1..N n i

      c(GST(N)) = Σi=1..N ni

      这篇关于Ukkonen的通用后缀树算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆