普通英语的Ukkonen后缀树算法 [英] Ukkonen's suffix tree algorithm in plain English

查看:63
本文介绍了普通英语的Ukkonen后缀树算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这一点上,我感觉有点厚.我已经花了几天的时间试图完全围绕后缀树的构造,但是由于我没有数学背景,因此许多解释都使我难以理解,因为它们开始过度使用数学符号系统.我发现的与最佳解释最接近的是 带后缀的快速字符串搜索树 ,但他掩盖了各个方面,算法的某些方面仍不清楚.

I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.

我敢肯定,在我这里,对于Stackoverflow上的这种算法的逐步说明对我之外的许多其他人来说都是无价的.

A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.

作为参考,这是Ukkonen关于算法的论文: http://www. cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

到目前为止,我的基本了解:

My basic understanding, so far:

  • 我需要遍历给定字符串T的每个前缀P
  • 我需要遍历前缀P中的每个后缀S并将其添加到树中
  • 要将后缀S添加到树中,我需要遍历S中的每个字符,其中的迭代包括沿着以S中相同的字符集C开头的现有分支以及可能将一条边分割成后代当我在后缀中遇到一个不同的字符时出现节点,或者如果没有匹配的边要走下来.如果找不到匹配的边沿C向下走,则会为C创建新的叶边.

正如大多数解释所指出的那样,基本算法似乎是O(n 2 ),因为我们需要遍历所有前缀,然后才需要遍历每个前缀.后缀的每个前缀. Ukkonen的算法显然是独特的,因为他使用了后缀指针技术,尽管我认为 是我难以理解的.

The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding.

我也很难理解:

  • 准确地分配,使用和更改活动点"的时间和方式
  • 该算法的规范化方面发生了什么
  • 为什么我看到的实现需要修复"它们正在使用的边界变量

这是完整的 C#源代码.它不仅可以正常工作,而且支持自动规范化,并呈现输出的外观更好的文本图.源代码和示例输出位于:

Here is the completed C# source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:

https://gist.github.com/2373868


更新2017-11-04

多年后,我发现后缀树有了新的用途,并且已经在 JavaScript 中实现了该算法.要点在下面.它应该是没有错误的.将其从同一位置转储到js文件npm install chalk中,然后与node.js一起运行以查看一些彩色输出.在同一Gist中有一个精简版,没有任何调试代码.

After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript. Gist is below. It should be bug-free. Dump it into a js file, npm install chalk from the same location, and then run with node.js to see some colourful output. There's a stripped down version in the same Gist, without any of the debugging code.

https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6 strong>

https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6

推荐答案

以下是描述Ukkonen算法的尝试:首先显示字符串简单(即不包含任何重复字符)时的行为,然后再显示扩展到完整算法.

The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.

首先,一些初步声明.

  1. 我们正在构建的基本上是 ,就像一个搜索树.所以那里 是根节点,边缘掉出来会导致新节点,并且

  1. What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth

但是:与搜索特里不同,边缘标签不是单一的 人物.而是使用一对整数来标记每个边 [from,to].这些是文本的指针.从这个意义上讲,每个 边缘带有任意长度的字符串标签,但仅占用O(1) 空格(两个指针).

But: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers [from,to]. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

基本原理

我想首先展示如何创建一个后缀树 特别简单的字符串,没有重复字符的字符串:

Basic principle

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:

abc

算法从左到右逐步进行操作.字符串的每个字符都有一步.每个步骤可能涉及一个以上的单独操作,但是我们将看到(请参阅最后的最终观察结果),操作总数为O(n).

The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

因此,我们从 left 开始,首先仅插入单个字符 a通过创建从根节点(在左侧)到叶的边, 并将其标记为[0,#],这意味着边缘代表 子字符串从位置0开始到当前结束结束.一世 使用符号#表示当前结束,它位于位置1 (在a之后).

So, we start from the left, and first insert only the single character a by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#], which means the edge represents the substring starting at position 0 and ending at the current end. I use the symbol # to mean the current end, which is at position 1 (right after a).

所以我们有一棵初始树,看起来像这样:

So we have an initial tree, which looks like this:

这是什么意思

现在,我们前进到位置2(在b之后). 我们在每个步骤中的目标 是要插入所有后缀到当前位置.我们这样做 由

Now we progress to position 2 (right after b). Our goal at each step is to insert all suffixes up to the current position. We do this by

  • 将现有的a -edge扩展到ab
  • b
  • 插入一个新边缘
  • expanding the existing a-edge to ab
  • inserting one new edge for b

在我们的表示中,这看起来像

In our representation this looks like

它的意思是:

我们观察两件事:

  • ab的边缘表示与以前一样相同 在初始树中:[0,#].它的含义已自动改变 因为我们将当前位置#从1更新为2.
  • 每个边缘消耗O(1)空间,因为它仅包含两个 指向文本的指针,无论它有多少个字符 代表.
  • The edge representation for ab is the same as it used to be in the initial tree: [0,#]. Its meaning has automatically changed because we updated the current position # from 1 to 2.
  • Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.

接下来,我们再次增加位置,并通过追加来更新树 c到每个现有边,并为新边插入一个新边 后缀c.

Next we increment the position again and update the tree by appending a c to every existing edge and inserting one new edge for the new suffix c.

在我们的表示中,这看起来像

In our representation this looks like

它的意思是:

我们遵守:

  • 树是直到当前位置的正确后缀树 每一步之后
  • 步骤与文字中的字符一样多
  • 每个步骤的工作量为O(1),因为所有现有边 通过增加#并插入 可以在O(1)中完成最后一个角色的新边缘 时间.因此,对于长度为n的字符串,只需要O(n)时间.
  • The tree is the correct suffix tree up to the current position after each step
  • There are as many steps as there are characters in the text
  • The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing #, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.

当然这很好用,只是因为我们的字符串没有 包含任何重复.现在我们来看一个更现实的字符串:

Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:

abcabxabcd

如上例中一样,它以abc开头,然后重复abx,然后重复abcd.

It starts with abc as in the previous example, then ab is repeated and followed by x, and then abc is repeated followed by d.

第1步到第3步:在前3步之后,我们有了上一个示例中的树:

Steps 1 through 3: After the first 3 steps we have the tree from the previous example:

步骤4::将#移至位置4.这将隐式更新所有现有的 的边缘:

Step 4: We move # to position 4. This implicitly updates all existing edges to this:

,我们需要在以下位置插入当前步骤的最后后缀a: 根.

and we need to insert the final suffix of the current step, a, at the root.

在执行此操作之前,我们先引入另外两个变量(除了 #),这当然一直都在那儿,但我们没有使用过 到目前为止:

Before we do this, we introduce two more variables (in addition to #), which of course have been there all the time but we haven't used them so far:

  • 活动点,它是三重 (active_node,active_edge,active_length)
  • remainder,它是一个整数,指示有多少个新后缀 我们需要插入
  • The active point, which is a triple (active_node,active_edge,active_length)
  • The remainder, which is an integer indicating how many new suffixes we need to insert

这两个的确切含义很快就会清楚,但是现在 我们只是说:

The exact meaning of these two will become clear soon, but for now let's just say:

  • 在简单的abc示例中,活动点始终是 (root,'\0x',0),即active_node是根节点,active_edge被指定为空字符'\0x'active_length为零.这样做的结果是, 我们在每个步骤中插入的内容都作为 刚创建的边缘.我们很快就会看到为什么需要三元组来 代表此信息.
  • remainder始终在每个开始处设置为1 步.意思是我们必须加上后缀的数量 在每个步骤的结尾处积极插入的内容为1(总是 最后一个字符).
  • In the simple abc example, the active point was always (root,'\0x',0), i.e. active_node was the root node, active_edge was specified as the null character '\0x', and active_length was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information.
  • The remainder was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

现在这将改变.当我们插入当前的决赛 根字符a,我们注意到已经有一个传出 以a开头的边,具体是:abca.这是我们要做的 这样的情况:

Now this is going to change. When we insert the current final character a at the root, we notice that there is already an outgoing edge starting with a, specifically: abca. Here is what we do in such a case:

  • 我们在根节点上插入新的边缘[4,#].相反,我们 只需注意后缀a已经在我们的 树.它结束于较长边缘的中间,但我们并非如此 对此感到困扰.我们只是把事情保持原样.
  • 我们设置活动点(root,'a',1).这意味着积极 点现在位于根节点以a开头的传出边缘的中间,特别是在该边缘上的位置1之后.我们 请注意,边缘仅由其第一个指定 字符a.这样就可以了,因为只能有一个边缘 以任何特定字符开头(在通读了整个说明之后,请确保这是正确的).
  • 我们还增加了remainder,因此在下一步开始时 将会是2.
  • We do not insert a fresh edge [4,#] at the root node. Instead we simply notice that the suffix a is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are.
  • We set the active point to (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description).
  • We also increment remainder, so at the beginning of the next step it will be 2.

观察:找到需要插入的最后一个后缀后, 已经存在于树中,树本身根本没有不变(我们仅更新活动点和remainder).那个树 则不能精确表示后缀树 当前位置,但它包含所有后缀(因为最后一个 后缀a隐式包含在 中).因此,除了更新 变量(都是固定长度的,所以这是O(1)), 没有工作,在此步骤中完成.

Observation: When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder). The tree is then not an accurate representation of the suffix tree up to the current position any more, but it contains all suffixes (because the final suffix a is contained implicitly). Hence, apart from updating the variables (which are all of fixed length, so this is O(1)), there was no work done in this step.

步骤5::我们将当前位置#更新为5.这 自动将树更新为此:

Step 5: We update the current position # to 5. This automatically updates the tree to this:

并且由于remainder为2 ,我们需要插入两个最终 当前位置的后缀:abb.这主要是因为:

And because remainder is 2, we need to insert two final suffixes of the current position: ab and b. This is basically because:

  • 上一步的a后缀从未正确 已插入.所以它保留了 ,并且自从我们取得了进步以来, 步骤,现在已经从a增长到ab.
  • 我们需要插入新的最终边缘b.
  • The a suffix from the previous step has never been properly inserted. So it has remained, and since we have progressed one step, it has now grown from a to ab.
  • And we need to insert the new final edge b.

在实践中,这意味着我们转到活动点(该点指向 现在abcab边缘的a后面),然后插入 当前的最终字符b. 但是:同样,事实证明b是 在同一条边上也已经存在.

In practice this means that we go to the active point (which points to behind the a on what is now the abcab edge), and insert the current final character b. But: Again, it turns out that b is also already present on that same edge.

因此,同样,我们不更改树.我们简单地:

So, again, we do not change the tree. We simply:

  • 将活动点更新为(root,'a',2)(相同的节点和边 和以前一样,但现在我们指向b)
  • remainder增大到3,因为我们仍然没有正确 插入了上一步的最后一条边,我们没有插入 还是当前的最终优势.
  • Update the active point to (root,'a',2) (same node and edge as before, but now we point to behind the b)
  • Increment the remainder to 3 because we still have not properly inserted the final edge from the previous step, and we don't insert the current final edge either.

要清楚:我们必须在当前步骤中插入abb,但是 因为已经找到ab,所以我们更新了活动点并执行了 甚至不尝试插入b.为什么?因为如果ab在树中, 其中的每个后缀(包括b)必须在树中, 也.也许只是隐式,但是它必须存在,因为 到目前为止我们建立树的方式.

To be clear: We had to insert ab and b in the current step, but because ab was already found, we updated the active point and did not even attempt to insert b. Why? Because if ab is in the tree, every suffix of it (including b) must be in the tree, too. Perhaps only implicitly, but it must be there, because of the way we have built the tree so far.

我们通过增加#继续执行步骤6 .那棵树是 自动更新为:

We proceed to step 6 by incrementing #. The tree is automatically updated to:

由于 remainder为3 ,我们必须插入abxbxx.活动点告诉我们ab的结束位置,因此我们只需要 跳到那里并插入x.确实,x还不存在,所以我们 分割abcabx边缘并插入内部节点:

Because remainder is 3, we have to insert abx, bx and x. The active point tells us where ab ends, so we only need to jump there and insert the x. Indeed, x is not there yet, so we split the abcabx edge and insert an internal node:

边缘表示仍然是文本的指针,因此 可以在O(1)时间内完成内部节点的拆分和插入.

The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.

所以我们处理了abx并将remainder减为2.现在我们 需要插入下一个剩余的后缀bx.但是在我们这样做之前 我们需要更新活动点.分割后的规则 并插入一条边线,以下将称为规则1 ,并且只要 active_node是根(对于其他情况,我们将进一步学习规则3 以下).这是规则1:

So we have dealt with abx and decrement remainder to 2. Now we need to insert the next remaining suffix, bx. But before we do that we need to update the active point. The rule for this, after splitting and inserting an edge, will be called Rule 1 below, and it applies whenever the active_node is root (we will learn rule 3 for other cases further below). Here is rule 1:

从根目录插入后,

After an insertion from root,

  • active_node仍然是根
  • active_edge设置为新后缀的第一个字符 需要插入,即b
  • active_length减少1
  • active_node remains root
  • active_edge is set to the first character of the new suffix we need to insert, i.e. b
  • active_length is reduced by 1

因此,新的活动点三元组(root,'b',1)表示 下一个插入必须在bcabx边上,在1个字符之后, 即b后面.我们可以确定O(1)时间的插入点,然后 检查x是否已经存在.如果存在,我们 会结束当前步骤,并保持原样.但是x 不存在,因此我们通过分割边缘将其插入:

Hence, the new active-point triple (root,'b',1) indicates that the next insert has to be made at the bcabx edge, behind 1 character, i.e. behind b. We can identify the insertion point in O(1) time and check whether x is already present or not. If it was present, we would end the current step and leave everything the way it is. But x is not present, so we insert it by splitting the edge:

同样,这花费了O(1)时间,我们将remainder更新为1,并且 规则1指出(root,'x',0)的有效点.

Again, this took O(1) time and we update remainder to 1 and the active point to (root,'x',0) as rule 1 states.

但是我们还需要做一件事.我们将其称为规则2:

But there is one more thing we need to do. We'll call this Rule 2:

如果我们分割一条边并插入一个新节点,而这不是 在当前步骤中创建的第一个节点,我们将之前的 插入的节点和新节点通过特殊指针后缀 链接.我们将在后面看到为什么这很有用.这就是我们得到的 后缀链接表示为虚线边缘:

If we split an edge and insert a new node, and if that is not the first node created during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:

我们仍然需要插入当前步骤的最后一个后缀, x.由于活动节点的active_length组件已下降 为0时,直接在根目录处进行最终插入.由于没有 在以x开头的根节点处的传出边缘,我们插入一个新 边缘:

We still need to insert the final suffix of the current step, x. Since the active_length component of the active node has fallen to 0, the final insert is made at the root directly. Since there is no outgoing edge at the root node starting with x, we insert a new edge:

我们可以看到,在当前步骤中,所有剩余的插入都已制作.

As we can see, in the current step all remaining inserts were made.

我们通过设置# = 7继续执行步骤7 ,该操作会自动附加下一个字符, 像往常一样,将a应用于所有叶子边缘.然后,我们尝试插入新的决赛 字符到活动点(根),并发现它在那里 已经.因此,我们在不插入任何内容的情况下结束了当前步骤,并且 将活动点更新为(root,'a',1).

We proceed to step 7 by setting #=7, which automatically appends the next character, a, to all leaf edges, as always. Then we attempt to insert the new final character to the active point (the root), and find that it is there already. So we end the current step without inserting anything and update the active point to (root,'a',1).

第8步中,# = 8,我们附加了b,如前所示,这仅 意味着我们将活动点更新为(root,'a',2)并增加remainder而不进行任何操作 其他任何事情,因为b已经存在. 但是,我们注意到(在O(1)时间内)活动点 现在处于边缘的尽头.我们通过将其重置为 (node1,'\0x',0).在这里,我使用node1来指代 内部节点,ab边在此结束.

In step 8, #=8, we append b, and as seen before, this only means we update the active point to (root,'a',2) and increment remainder without doing anything else, because b is already present. However, we notice (in O(1) time) that the active point is now at the end of an edge. We reflect this by re-setting it to (node1,'\0x',0). Here, I use node1 to refer to the internal node the ab edge ends at.

然后,在步骤# = 9 中,我们需要插入"c",这将有助于我们 了解最后的把戏:

Then, in step #=9, we need to insert 'c' and this will help us to understand the final trick:

与往常一样,#更新会自动将c附加到叶子边缘 然后转到活动点,看看是否可以插入"c".事实证明 "c"已经存在于该边缘,因此我们将活动点设置为 (node1,'c',1),增加remainder,然后什么也不做.

As always, the # update appends c automatically to the leaf edges and we go to the active point to see if we can insert 'c'. It turns out 'c' exists already at that edge, so we set the active point to (node1,'c',1), increment remainder and do nothing else.

现在在步骤# = 10 中,remainder为4,因此我们首先需要插入 abcd(保留3步之前),方法是在活动目录中插入d 点.

Now in step #=10, remainder is 4, and so we first need to insert abcd (which remains from 3 steps ago) by inserting d at the active point.

尝试在活动点插入d会导致边缘分裂 O(1)时间:

Attempting to insert d at the active point causes an edge split in O(1) time:

从中开始拆分的active_node标记为 红色以上.这是最终规则,规则3:

The active_node, from which the split was initiated, is marked in red above. Here is the final rule, Rule 3:

从不是根的active_node分割一条边后 节点,我们跟随该节点出的后缀链接(如果存在) 任何一个,然后将active_node重置为其所指向的节点.如果有 没有后缀链接,我们将active_node设置为根. active_edgeactive_length保持不变.

After splitting an edge from an active_node that is not the root node, we follow the suffix link going out of that node, if there is any, and reset the active_node to the node it points to. If there is no suffix link, we set the active_node to the root. active_edge and active_length remain unchanged.

因此活动点现在是(node2,'c',1),并且node2被标记为 下面是红色:

So the active point is now (node2,'c',1), and node2 is marked in red below:

由于abcd的插入已完成,因此我们将remainder递减为 3,并考虑当前步骤的下一个剩余后缀, bcd.规则3已将活动点设置为仅正确的节点和边缘 因此插入bcd可以通过简单地插入其最后一个字符来完成 d在活动点.

Since the insertion of abcd is complete, we decrement remainder to 3 and consider the next remaining suffix of the current step, bcd. Rule 3 has set the active point to just the right node and edge so inserting bcd can be done by simply inserting its final character d at the active point.

这样做会导致另一个边缘分裂,并且由于规则2 ,我们 必须创建一个从先前插入的节点到新节点的后缀链接 一个:

Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:

我们观察到:后缀链接使我们能够重置活动点,因此我们 可以花费O(1)进行下一个剩余插入.看着那(这 上图确认标签ab上的节点确实已链接到 b上的节点(后缀),并且abc上的节点链接到 bc.

We observe: Suffix links enable us to reset the active point so we can make the next remaining insert at O(1) effort. Look at the graph above to confirm that indeed node at label ab is linked to the node at b (its suffix), and the node at abc is linked to bc.

当前步骤尚未完成. remainder现在是2,我们 需要遵循规则3再次重置活动点.自从 当前active_node(上面的红色)没有后缀链接,我们重置为 根.现在的活动点是(root,'c',1).

The current step is not finished yet. remainder is now 2, and we need to follow rule 3 to reset the active point again. Since the current active_node (red above) has no suffix link, we reset to root. The active point is now (root,'c',1).

因此,下一次插入发生在根节点的一个输出边缘上 其标签以c:cabxabcd开头,在第一个字符之后, 即在c之后.这会导致另一个分裂:

Hence the next insert occurs at the one outgoing edge of the root node whose label starts with c: cabxabcd, behind the first character, i.e. behind c. This causes another split:

由于这涉及到创建一个新的内部节点,因此我们遵循 规则2并从先前创建的内部链接中设置一个新的后缀链接 节点:

And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:

(我正在为这些小工具使用 Graphviz点 图.新的后缀链接导致点重新排列现有的 边缘,因此请仔细检查以确认 上面插入的是新的后缀链接.)

(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)

这样,可以将remainder设置为1,因为active_node是 根,我们使用规则1将活动点更新为(root,'d',0).这 表示当前步骤的最后插入是插入单个d 根源:

With this, remainder can be set to 1 and since the active_node is root, we use rule 1 to update the active point to (root,'d',0). This means the final insert of the current step is to insert a single d at root:

那是最后一步,我们已经完成. 最终数量 观察,但是:

That was the final step and we are done. There are number of final observations, though:

  • 在每一步中,我们将#向前移动1个位置.这会自动 在O(1)时间内更新所有叶节点.

  • In each step we move # forward by 1 position. This automatically updates all leaf nodes in O(1) time.

但是它不处理a)上一个剩余的后缀 步骤,以及b)带有当前步骤的最后一个字符.

But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step.

remainder告诉我们需要多少个附加插入物 制作.这些插入一对一地对应于最后的后缀 在当前位置#结束的字符串.我们考虑一个 在另一个之后,并进行插入. 重要:每次插入 因为活动点告诉我们确切的位置,所以在O(1)时间内完成 前进,我们只需要在活动窗口中添加一个字符 观点.为什么?因为其他字符是隐式包含的 (否则,活动点将不在当前位置).

remainder tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important: Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly (otherwise the active point would not be where it is).

在每次这样的插入之后,我们递减remainder并遵循 后缀链接(如果有).如果没有,我们就扎根(规则3).要是我们 已经在根目录上,我们使用规则1修改活动点. 无论如何,只需要O(1)时间.

After each such insert, we decrement remainder and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.

如果在这些插入之一中发现我们想要的字符 插入已经在那里,我们什么也不做,结束 当前步,即使> 0.原因是 剩余的插入将是我们刚尝试插入的后缀 制作.因此,它们在当前树中都是隐式.事实 确保> 0可以确保我们处理剩余的后缀 以后.

If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicit in the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later.

如果在算法> 0结束时该怎么办?这将是 当文本结尾是出现的子字符串时 之前的某个地方.在这种情况下,我们必须追加一个额外的字符 在之前未发生过的字符串的末尾.在里面 文献中,通常将美元符号$用作 那. 这为什么重要?->如果以后我们使用完整的后缀树搜索后缀,则仅当它们以叶子结尾时,我们才必须接受匹配项.否则,我们将得到很多虚假的匹配,因为树中包含许多个字符串隐式,它们不是主字符串的实际后缀.在最后强制remainder为0是一种确保所有后缀都在叶节点处结束的方法. 但是,如果我们要使用树来搜索通用子字符串,不仅要搜索主字符串的后缀,这最后一步的确不是如以下OP的评论所建议的那样.

What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $ is used as a symbol for that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP's comment below.

那么整个算法的复杂度是多少?如果文字是n 字符长度,显然有n步(如果加,则为n + 1 美元符号).在每一步中,我们要么什么都不做(除了 更新变量),或者我们进行remainder插入,每个插入都使用O(1) 时间.由于remainder表示我们无执行了多少次 在前面的步骤中,对于我们制作的每个插入物都会减少 现在,我们做某事的总次数正好是n(或 n + 1).因此,总复杂度为O(n).

So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make remainder inserts, each taking O(1) time. Since remainder indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n).

但是,有一件我没有正确解释的小事情: 可能发生的是,我们跟随一个后缀链接,更新了活动的 点,然后发现其active_length组件没有 与新的active_node一起很好地工作.例如,考虑一种情况 像这样:

However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. For example, consider a situation like this:

(虚线表示树的其余部分.虚线是 后缀链接.)

(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)

现在让活动点为(red,'d',3),因此它指向该位置 在defg边缘上的f后面.现在假设我们做了必要的 更新,现在单击后缀链接以更新活动点 根据规则3.新的活动点为(green,'d',3).然而, 从绿色节点出来的d -edge是de,所以它只有2个 人物.为了找到正确的活动点,我们显然 需要跟随该边缘到达蓝色节点并重置为(blue,'f',1).

Now let the active point be (red,'d',3), so it points to the place behind the f on the defg edge. Now assume we made the necessary updates and now follow the suffix link to update the active point according to rule 3. The new active point is (green,'d',3). However, the d-edge going out of the green node is de, so it has only 2 characters. In order to find the correct active point, we obviously need to follow that edge to the blue node and reset to (blue,'f',1).

在特别糟糕的情况下,active_length可能与 remainder,它可以和n一样大.这很可能会发生 要找到正确的活动点,我们不仅需要跳过一个 内部节点,但在最坏的情况下可能多达n个.那会吗 意味着该算法具有隐藏的O(n 2 )复杂度,因为 在每个步骤中,remainder通常为O(n),并且后期调整 跟随后缀链接后的活动节点的活动地址也可以是O(n)?

In a particularly bad case, the active_length could be as large as remainder, which can be as large as n. And it might very well happen that to find the correct active point, we need not only jump over one internal node, but perhaps many, up to n in the worst case. Does that mean the algorithm has a hidden O(n2) complexity, because in each step remainder is generally O(n), and the post-adjustments to the active node after following a suffix link could be O(n), too?

不.原因是,如果确实需要调整活动点 (例如,从上面的绿色到蓝色),这将我们带到了一个新的节点, 有自己的后缀链接,并且active_length将被减少.作为 我们跟随后缀链接链,我们完成其余的插入,active_length只能 减少,我们可以进行的活动点调整的次数 该方式在任何给定时间都不能大于active_length.自从 active_length不能大于remainder,并且remainder 不仅在每一步中都是O(n),而且增量的总和是 在整个过程中对remainder做过的 也是O(n),有效点调整的数量也受以下限制 O(n).

No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and active_length will be reduced. As we follow down the chain of suffix links we make the remaining inserts, active_length can only decrease, and the number of active-point adjustments we can make on the way can't be larger than active_length at any given time. Since active_length can never be larger than remainder, and remainder is O(n) not only in every single step, but the total sum of increments ever made to remainder over the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).

这篇关于普通英语的Ukkonen后缀树算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆