在文本中搜索单词列表的算法 [英] Algorithm to search for a list of words in a text

查看:99
本文介绍了在文本中搜索单词列表的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个单词列表,大约只有1000个左右.我想检查输入文本中是否存在该列表中的任何单词.如果是这样,我想知道发生了什么.输入的文本每个都是几百个单词,这些都是来自网络的文本段落-意味着很多来自不同站点.我正在尝试找到最佳的算法.

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.

我可以看到两种显而易见的方法-

I can see two obvious ways to do this --

  1. 一种从文本列表中搜索每个单词的蛮力方式.

  1. A brute force way of searching for each word from the list in the text.

从输入文本创建单词的哈希表,然后从哈希表的列表中搜索每个单词.这很快.

Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.

有更好的解决方案吗?

Is there a better solution?

我正在使用python,尽管不确定是否会更改算法.

I am using python though I am not sure if that changes the algorithm anyway.

作为对上述解决方案2的优化,我想将生成的哈希表存储到持久性存储(DB)中,这样,如果单词列表发生更改,我可以重新使用哈希表而不必再次创建它.当然,如果输入文本发生更改,我必须生成哈希表.是否可以将哈希表保存到数据库?有什么建议吗?我目前正在将MongoDB用于我的项目,并且只能在其中存储json文档.我是MongoDB的新手,刚刚开始使用它,但仍然不完全了解它的全部潜力.

Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.

我已经搜索了SO,并且看到了类似的问题,其中两个提出了一个哈希表,但是我想获得任何指向我所想到的优化的指针.

I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.

以下是关于SO的先前询问的问题-

Here are the previously asked questions on SO -

是否有一种有效的算法可以执行反向全文搜索?

在另一个大型单词中搜索大量单词列表

我刚刚在SO上发现了另一个问题,即同样的问题.

I just found another question on SO which is about the same problem.

文本中多个单词匹配的算法

我想没有比哈希表更好的解决方案了.但是我真的很想对其进行优化,以使对单词列表的更改可以让我对快速存储的所有文本运行算法.我是否应该更改添加到问题中的标签,使其也包含一些数据库技术?

I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?

推荐答案

有一个 比哈希表更好的解决方案.如果您要在大量文本中搜索一组固定的单词,则使用

There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.

该算法从您要搜索的单词构建状态机,然后通过该状态机运行输入文本,并在找到匹配项时输出匹配项.由于构建状态机需要花费一些时间,因此该算法最适合搜索非常大的文本体.

The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.

您可以对正则表达式执行类似的操作.例如,您可能想在某些文本中找到单词狗",猫",马"和臭鼬".您可以构建一个正则表达式:

You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:

"dog|cat|horse|skunk"

然后在文本上运行正则表达式匹配.如何获得所有匹配项将取决于您的特定正则表达式库,但是它确实起作用.对于非常大的单词列表,您需要编写代码以读取单词并生成正则表达式,但是这样做并不难,而且效果很好.

And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.

但是,正则表达式的结果与Aho-Corasick算法的结果有所不同.例如,如果您要在字符串我的业力吃了你的教条"中搜索单词"dog"和"dogma".正则表达式库搜索将报告发现教条". Aho-Corasick实施将报告在同一位置找到"dog"和"dogma".

There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.

如果要让Aho-Corasick算法仅报告整个单词,则必须稍作修改.

If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.

正则表达式也将报告部分单词的匹配情况.也就是说,如果您要搜索"dog",它将在"dogma"中找到它.但是您可以修改正则表达式以仅给出整个单词.通常,这是通过\b完成的,例如:

Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:

"\b(cat|dog|horse|skunk)\b"

您选择的算法在很大程度上取决于输入文本的大小.如果输入文本不太大,则可以创建要查找的单词的哈希表.然后遍历输入文本,将其分解为单词,然后检查哈希表以查看单词是否在表中.用伪代码:

The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:

hashTable = Build hash table from target words
for each word in input text
    if word in hashTable then
        output word

或者,如果您想要输入文本中匹配单词的列表:

Or, if you want a list of matching words that are in the input text:

hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
    if word in hashTable then
        add word to foundWords

这篇关于在文本中搜索单词列表的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆