Lucene索引和查询设计问题 - 搜索人 [英] Lucene Index and Query Design Question - Searching People

查看:141
本文介绍了Lucene索引和查询设计问题 - 搜索人的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近刚刚开始与Lucene(特别是Lucene.Net)合作并成功创建了几个指标,并且没有任何问题。之前曾与Endeca合作,我发现Lucene轻巧,功能强大,学习曲线要​​低得多(主要是因为简洁的API)。

I have recently just started working with Lucene (specifically, Lucene.Net) and have successfully created several indicies and have no problem with any of them. Previously having worked with Endeca, I find that Lucene is lightweight, powerful, and has a much lower learning curve (due mostly to a concise API).

但是,我有一个特定的索引/查询情况,我有问题包裹我的脑袋。我所拥有的是个人目录。可以在此应用程序中搜索人员,目的是返回精确匹配和近似匹配。现在,在索引中我将FirstName和LastName连接成一个名为FullName的字段,在两者之间添加一个空格。所以FirstName:Jon with LastName:Smith yield FullName:Jon Smith。我确实预见到中间名和可能后缀的可能性,但目前这并不重要。

However, I have one specific index/query situation which I am having problems wrapping my head around. What I have is a person directory. People can be searched for in this application, with the goal of returning both exact and approximate matches. Right now, in the index I concatenate the "FirstName" and "LastName" into a single field called "FullName", adding a space between the two. So FirstName:Jon with LastName:Smith yield FullName:Jon Smith. I do anticipate the possibility of middle names and possibly suffix, but that is not important at the moment.

我想在名称上进行模糊搜索,所以搜索约翰史密斯的人仍然会回来乔恩史密斯。我曾考虑过一个多元游戏,然而,如果他的名字实际上是Jon Del Carmen或Jon Paul Del Carmen,那么这就变得更加复杂。用户输入的内容中没有任何内容可以描述名字或姓氏。

I would like to do the equivalent of a fuzzy search on the name, so someone searching for "John Smith" would still get back "Jon Smith". I had thought about a multisearch, however, this becomes more involved if his name was actually "Jon Del Carmen" or "Jon Paul Del Carmen". I have nothing in what the user types in to delineate the first name or last name pieces.

我唯一想到的是我可以替换连接值中的空格具有不会被丢弃的角色。如果我在为索引构建文档时执行此操作,并且在解析查询时,我可以将其视为一个更大的单词,对吧?还有另一种方法可以用于简单名称(Jon Smith)和更复杂的名称(Jon Paul Del Carmen)吗?

The only thought that I have is that I could replace spaces in the concatenated value with a character that would not be discarded. If I did this when I built the document for the index and also when I parsed the query, I could treat it as one larger word, right? Is there another way to do this that would work for both simple names ("Jon Smith") and also more complex names ("Jon Paul Del Carmen")?

任何建议真的很值得赞赏。提前致谢!

Any advice would truly be appreciated. Thanks in advance!

编辑:其他详细信息如下。

在Luke,我输入以下查询:

In Luke, I put in the following query:

FullName:jonn smith~

它被解析为:

FullName:jonn CreatedOn:smith~0.5

解释:

BooleanQuery:boost=1.0000
    clauses=2, maxClauses=1024
    Clause 0: SHOULD
        TermQuery:boost=1.0000
            Term: field='FullName' text='jonn'
    Cluase 1: SHOULD
        FuzzyQuery: boost=1.0000
            prefixLen=0, minSimilarity=0.5000
            org.apache.lucene.search.FuzzyTermEnum: diff=-1.0000
            FilteredTermEnum: Exception null

CreatedOn是另一个字段指数。我尝试在jonn smith这个术语周围加上引号,但后来却将其视为一个短语查询。我确信问题在于我只是做得不对,但在这一切都是如此绿色,我不确定那是什么。

"CreatedOn" is another Field in the index. I tried putting quotes around the term "jonn smith", but it then treats it like a phrasequery, instead. I am sure that the problem is that I am just not doing something right, but being so green at all of this, I am not sure what that something truly is.

推荐答案

我的问题在于我如何构建索引。我最终做的是确保它没有标记FullName,并且查询开始返回正确的结果。上面的解释结果是由于我的ID10T错误,现在正确返回。

My problem was with how I was building the index. What I ended up doing was making sure that it was not tokenizing the FullName, and the query started returning the correct results. The Explain results from above were due to an ID10T error on my part and is now returning correctly.

这篇关于Lucene索引和查询设计问题 - 搜索人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆