有效搜索小文本 [英] Effective search on a small text
问题描述
我有很多小文本(比方说大约500个单词)和两个数据库,每个数据库大约有10,000个条目(关键字)。
I have many small texts (lets say about 500 words) and two databases with roughly 10.000 entries each (keywords).
我现在想要处理每个文本和找出哪些关键字(保存在2个数据库中的关键字)包含在文本中。
I now want to process every text and find out which keywords (the ones saved in the 2 databases) are contained in the text.
你们中有谁有一个如何有效地做到这一点的好方法?
Does anyone of you have a good approach on how to do this effectively?
我想在搜索数据库之前处理每个文本并将其编入索引(或者使用lucene),但我真的不知道lucene是否是正确的工具这个。
I wanted to process every text and index it (with lucene perhaps) before searching the database against it, but I don't really know if lucene is the right tool for this.
推荐答案
Lucene是完成此任务的正确工具。
Lucene is exactly the right tool for this task.
实现目标的一种方法是使用RAMDirectory索引每个文本,然后使用IndexReader从索引中获取TermEnum。您现在可以将这些条款与数据库中的关键字进行匹配。
One way to achieve your goal would be to use a RAMDirectory to index each text and then get the TermEnum from the index using the IndexReader. You can now match the terms against the keywords in your DB.
另一种方法是将每个文本索引为lucene文档,然后迭代关键字并获取termDocs当前术语=>包含当前术语/关键字的所有文本。
Another approach would be to index every text as lucene document and then iterate over your keywords and get the termDocs for the current term => all texts that contain the current term/keyword.
这篇关于有效搜索小文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!