有效搜索小文本 [英] Effective search on a small text

查看:116
本文介绍了有效搜索小文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多小文本(比方说大约500个单词)和两个数据库,每个数据库大约有10,000个条目(关键字)。

I have many small texts (lets say about 500 words) and two databases with roughly 10.000 entries each (keywords).

我现在想要处理每个文本和找出哪些关键字(保存在2个数据库中的关键字)包含在文本中。

I now want to process every text and find out which keywords (the ones saved in the 2 databases) are contained in the text.

你们中有谁有一个如何有效地做到这一点的好方法?

Does anyone of you have a good approach on how to do this effectively?

我想在搜索数据库之前处理每个文本并将其编入索引(或者使用lucene),但我真的不知道lucene是否是正确的工具这个。

I wanted to process every text and index it (with lucene perhaps) before searching the database against it, but I don't really know if lucene is the right tool for this.

推荐答案

Lucene是完成此任务的正确工具。

Lucene is exactly the right tool for this task.

实现目标的一种方法是使用RAMDirectory索引每个文本,然后使用IndexReader从索引中获取TermEnum。您现在可以将这些条款与数据库中的关键字进行匹配。

One way to achieve your goal would be to use a RAMDirectory to index each text and then get the TermEnum from the index using the IndexReader. You can now match the terms against the keywords in your DB.

另一种方法是将每个文本索引为lucene文档,然后迭代关键字并获取termDocs当前术语=>包含当前术语/关键字的所有文本。

Another approach would be to index every text as lucene document and then iterate over your keywords and get the termDocs for the current term => all texts that contain the current term/keyword.

这篇关于有效搜索小文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆