Aho-Corasick整个单词的文字匹配?(Aho-Corasick text matching on whole words?)

15 IT屋

I'm using Aho-Corasick text matching and wonder if it could be altered to match terms instead of characters. In other words, I want the the terms to be the basis of matching rather than the characters. As an example:

Search query: "He",

Sentence: "Hello world",

Aho-Corasick will match "he" to the sentence "hello world" ending at index 2, but I would prefer to have no match. So, I mean by "terms" words rather than characters.

解决方案

One way to do this would be to use Aho-Corasick as usual, then do a filtering step where you eliminate all false positives. For example, every time you find a match, you can confirm that the next and previous characters in the input are non-letter characters like spaces or punctuation. That way, you get the speed of the Aho-Corasick lookup, but only consider matches that appear as whole words in the text.

Hope this helps!

我正在使用Aho-Corasick文本匹配功能,想知道是否可以更改它以匹配条款而不是字符。换句话说,我希望这些术语成为匹配的基础,而不是字符。例如:



搜索查询:"他",



句子:" Hello world",



Aho-Corasick将" he"与以索引2结尾的句子" hello world"匹配,但我希望没有匹配项。因此,我的意思是"术语"而不是字符。


解决方案

一种方法是使用Aho-Corasick像往常一样,然后执行过滤步骤,以消除所有误报。例如,每次找到匹配项时,您都可以确认输入中的下一个和上一个字符是非字母字符,例如空格或标点符号。这样,您可以获得Aho-Corasick查找的速度,但只考虑在文本中显示为整个单词的匹配项。



希望这会有所帮助!


本文地址:IT屋 » Aho-Corasick整个单词的文字匹配?