使用Aho-Corasick,可以在构建初始树后添加字符串吗? [英] Using Aho-Corasick, can strings be added after the initial tree is built?

查看:91
本文介绍了使用Aho-Corasick,可以在构建初始树后添加字符串吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在大量文档中搜索字符串。我有一个预定义的字符串列表,可以在每个文档中找到。每个文档的开头都包含一个标题,后跟文本,并且标题中是我想在标题下方的文本中搜索的其他字符串。

I want to search for strings inside a large number of documents. I have a predefined list of strings available that I want to find in each document. Each document contains a header at the beginning followed by the text and in the header are additional strings I want to search for in the text below the header.

在文档的每次迭代中,是否可以在创建由主列表制成的初始树后添加标头字符串?还是修改原始数据结构以包含新字符串?

On each iteration of document, is it possible to add the header strings after creating the initial tree that was made from the main list? Or modify the original data structure to include the new strings?

如果这样做不切实际,是否还有其他更合适的搜索方法?

If this is not practical to do, is there an alternative search method that would be more appropriate?

推荐答案

如果每个文档都有其自己的字符串集进行搜索,则似乎您可以只构建一个全局Aho-Corasick匹配器,然后再构建一个基于文档的匹配器。然后,在处理文档中的字符时,将它们分别输入到两个匹配的自动机中,并报告以此方式找到的所有匹配项。这样一来,您就无需向主自动机添加新字符串,也无需在完成后将其删除。另外,减速应该很小。

If each document has its own set of strings to search for, it seems like you could just build one global Aho-Corasick matcher and then a second, per-document matcher. Then, as you process the characters in the document, feed each into both of the matching automata and report all matches found this way. That eliminates the need to add new strings to the master automaton and to remove them when you're done. Plus, the slowdown should be pretty minimal.

希望这会有所帮助!

这篇关于使用Aho-Corasick,可以在构建初始树后添加字符串吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆