部分词的文档搜索 [英] Document search on partial words

查看:90
本文介绍了部分词的文档搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个能够搜索部分术语的文档搜索引擎(例如Xapian,Whoosh,Lucene,Solr,Sphinx或其他).

I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms.

例如,当搜索"brit"一词时,搜索引擎应返回包含"britney"或"britain"的文档,或者通常返回包含与r *brit*匹配的单词的任何文档.

For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r*brit*

从零距离来看,我注意到大多数引擎都使用TF-IDF(术语频率-文档频率的倒数)或其派生词,它们基于完整术语而非部分术语.除了TF-IDF之外,还有其他成功实现文档检索的技术吗?

Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms. Are there any other techniques that have been successfully implemented besides TF-IDF for document retrieval?

推荐答案

使用lucene,您可以通过多种方式实现此目的:

With lucene you would be able to implement this in several ways:

1.)您可以使用通配符查询*brit*(您必须将查询解析器设置为允许前导通配符)

1.) You can use wildcard queries *brit* (You would have to set your query parser to allow leading wild cards)

2.)您可以创建一个包含

2.) You can create an additional field containing N-Grams of all the terms. This would result in larger indexes, but would be in many cases faster (search speed).

3.)您可以使用模糊搜索来处理查询中的键入错误.例如有人输入britnei但想找到britney.

3.) You can use fuzzy search to handle typing mistakes in the query. e.g. someone typed britnei but wanted to find britney.

对于通配符查询和模糊搜索,请查看查询语法文档

For wildcard queries and fuzzy search have a look at the query syntax docs.

这篇关于部分词的文档搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆