部分词的文档搜索 [英] Document search on partial words
问题描述
我正在寻找一个能够搜索部分术语的文档搜索引擎(例如Xapian,Whoosh,Lucene,Solr,Sphinx或其他).
I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms.
例如,当搜索"brit"一词时,搜索引擎应返回包含"britney"或"britain"的文档,或者通常返回包含与r *brit*
匹配的单词的任何文档.
For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r*brit*
从零距离来看,我注意到大多数引擎都使用TF-IDF(术语频率-文档频率的倒数)或其派生词,它们基于完整术语而非部分术语.除了TF-IDF之外,还有其他成功实现文档检索的技术吗?
Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms. Are there any other techniques that have been successfully implemented besides TF-IDF for document retrieval?
推荐答案
使用lucene,您可以通过多种方式实现此目的:
With lucene you would be able to implement this in several ways:
1.)您可以使用通配符查询*brit*
(您必须将查询解析器设置为允许前导通配符)
1.) You can use wildcard queries *brit*
(You would have to set your query parser to allow leading wild cards)
2.) You can create an additional field containing N-Grams of all the terms. This would result in larger indexes, but would be in many cases faster (search speed).
3.)您可以使用模糊搜索来处理查询中的键入错误.例如有人输入britnei
但想找到britney
.
3.) You can use fuzzy search to handle typing mistakes in the query. e.g. someone typed britnei
but wanted to find britney
.
对于通配符查询和模糊搜索,请查看查询语法文档
For wildcard queries and fuzzy search have a look at the query syntax docs.
这篇关于部分词的文档搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!