面对插入和删除操作更新Aho-Corasick特里 [英] Updating an Aho-Corasick trie in the face of inserts and deletes

查看:112
本文介绍了面对插入和删除操作更新Aho-Corasick特里的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我所发现的有关Aho-Corasick的所有文献和实现都是关于用一组短语预先构建整个Trie的。但是,我对将其作为可变数据结构使用的方式感兴趣,该结构可以处理偶尔的添加和删除而无需重建整个Trie(假设其中有100万个条目)。只要最坏的情况很糟,只要平均情况接近对数就可以。



根据我的理解,每个节点的失败状态是另一个节点使用相同的符号。因此,如果我们有一个从每个符号到使用该符号的节点列表的哈希多图,则我们的候选对象的失败状态需要更新。



删除很简单。您会找到所有其他所有将已删除节点用作失败状态的节点,然后重新计算其失败状态。从字符串的末尾开始向后走,树应该仍然处于良好状态。



添加有点棘手。未能通过该符号的任何节点都可以将新节点作为更好的候选者。但是,再次似乎足以遍历带有该符号的每个其他节点并完全重新计算其失败状态。



换句话说,如果我们要添加或删除带有符号 A的节点,我们需要在特里树中的任意位置访问其他每个 A节点,并重新计算失败状态(寻找以 A作为子元素的最接近的祖先,或者根)。这将需要访问符号 A的每个节点,但是对于我来说,这比访问特里树中的每个节点要小几个数量级。



算法工作,还是我遗漏了一些明显的东西?

解决方案

我继续实施并实现了它,似乎开始工作。对该算法进行了更为详尽的描述,包括图片:



结果:





请注意,在相同的删除操作中,节点的失败状态可能会更新2次或
次,但最终会被更新
正确。可以先删除所有内容,然后再删除
,再通过令牌返回,这样可以避免这种情况,但是由于
的代码足够复杂,因此只能在某些
尴尬的情况下提供性能优势,而在平均
的情况下却会降低性能。我们只是假设包含相同令牌的字符串超过
的情况很少见。



添加节点稍微困难些,但是可以使用相同的
哈希多图结构。每次添加一个节点时,使用相同令牌且深度比添加的
节点低的其他节点
可能需要将其故障状态更新为新节点。 / p>

为说明这一点,请想象从第二张图回到第一张图
。同样的两个失败状态是如果将字符串caa添加回该trie需要更新
的状态,而
因为我们重新计算了每个c和一个节点,所以这不是问题。



我们可以跟踪节点深度以跳过大约一半,但是
现在只是重新计算
共享令牌的每个节点的失败状态以节省内存。这意味着尝试将
令牌多次出现的地方添加起来很慢,但是再次考虑到
被认为是非常罕见的情况。如果您
试图将这项技术改编成DNA序列或
防病毒数据库(其中有一个小的字母),那将是一个问题。



时间复杂度取决于多少个节点共享符号以及
的特里深度是多少,这意味着它是最坏情况下的O(N),但实际上
的数量很少(平均情况大约是O(log ^ 2 N))。


令人难以置信的混乱且几乎没有注释的代码,但是如果您很好奇,这里的标题身体一些测试


All the literature and implementations I've found of Aho-Corasick are about building the entire trie beforehand from a set of phrases. However, I'm interested in ways to work with it as a mutable data structure, where it can handle occasional adds and removes without needing to rebuild the whole trie (imagine there are 1 million entries in it). It's OK if the worst case is awful, as long as the average case is close to logarithmic.

From how I figure it, the fail state for each node is another node that uses the same symbol. So if we have a hash multimap from each symbol to the list of nodes which use that symbol, we have our candidates whose fail states need updating.

Remove is straightforward. You find all other nodes which use the deleted node as a fail state and recompute their fail state. Go from the end of the string backwards, and the tree should still be in good shape.

Add is a bit trickier. Any node that failed to that symbol could have the new node as a better candidate. However, again it seems to be enough to traverse every other node with that symbol and fully recompute its fail state.

In other words, if we're adding or removing a node with symbol "A", we need to visit every other "A" node anywhere in the trie, and recompute the fail state (look for its closest ancestor with "A" as a child, or the root). This would require visiting every node for symbol "A", but in my case, that would be several orders of magnitude less than visiting every node in the trie.

Would that algorithm work, or am I missing something obvious?

解决方案

I went ahead and implemented it and it seems to be working. A somewhat nicer description of the algorithm, including pictures: https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie

For posterity (and per StackOverflow policy), I've copied it here:

It supports modification (add/remove) instead of being built all at once from a pre-set dictionary. This isn't mentioned in any literature about it, and all the open-source implementations I've seen of it have followed in that respect. It turns out that supporting add and remove operations to an AC automaton is fairly trivial. For deep tries over a small alphabet, it would not be very time-efficient, but luckily we have a shallow trie over a large alphabet.

We store a hash multimap of each token to every node that uses that token. When we remove a phrase, we start from the last (bottom-most) node. We remove the pointer to the phrase from this bottom-most node. Then, if it has any children, we can't delete any more. Otherwise, go through each other node that uses this node as a fail state and recompute its fail state. Finally, delete the node, then go up to the node's parent and do the same process. We'll eventually reach either a node that has another phrase output, a child, or the root node.

That's kind of difficult to picture, so consider this trie (stolen from the Wikipedia article). We're going to remove the string caa (light gray color), which will require that the fail states in yellow be updated:

The result:

Note there are cases where a node's fail state might be updated 2 or more times in the same remove operation, but will eventually be correct. This could be avoided by removing everything first, then going back through tokens, but the code is complicated enough as it is, and that would only provide a performance benefit in certain awkward circumstances, while being a performance hit in the average case. We just assume that strings containing the same token more than once are rare.

Adding a node is slightly more difficult, but can make use of the same hash multimap structure. Each time a node is added, every other node in the trie that uses the same token and is at a lower depth than the added node could need to have its fail state updated to the new node.

To help picture that, imagine going from the second diagram back to the first diagram. The same two fail states are the ones who would need to be updated if adding the string caa back into that trie, and since we recompute every c and a node, it's not a problem.

We could keep track of the node depths to skip over about half, but right now it's just recomputing the fail state for every node that shares the token to save memory. This means that tries where the tokens appear many times will be slow to add to, but again that's considered a rare enough situation. It would be a concern if you were trying to adapt this technique to something like a DNA sequence or antivirus database where there is a small alphabet.

The time complexity depends on how many nodes share symbols and how deep the trie is, meaning it's worst-case O(N), but in practice a fairly small number (average-case is roughly O(log^2 N) ).

Incredibly messy and mostly uncommented code, but if you're curious, here's the header, the body, and some tests

这篇关于面对插入和删除操作更新Aho-Corasick特里的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆