使用 Apache Solr 搜索名称 [英] Searching names with Apache Solr

查看:22
本文介绍了使用 Apache Solr 搜索名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚涉足了看似简单但极其复杂的搜索世界.对于一个应用程序,我需要建立一个搜索机制来按用户的名字搜索用户.

阅读大量帖子和文章后,包括:

我该怎么做使用 Lucene 进行个人姓名(名字、姓氏)搜索?
http://dublincore.org/documents/1998/02/03/name-表示/
最好的方法是什么通过优先考虑用户关系来搜索社交网络?
http://www.gossamer-threads.com/lists/lucene/java-user/120417
Lucene 索引和查询设计问题 - 搜索人物
Lucene模糊搜索客户姓名和部分地址

...以及我目前找不到的其他一些.并在我的机器上至少进行索引和基本搜索,我为用户搜索设计了以下方案:

1) 有第一个、第二个和第三个名称字段,并用 Solr 索引这些字段
2) 使用 edismax 作为多列搜索的 requestParser
3) 使用归一化过滤器的组合,例如:音译、拉丁文到 ASCII 的转换等
4)最后使用模糊搜索

显然,我对此很陌生,我不确定上述是否是最好的方法,并希望听取在该领域比我有更好想法的有经验的用户的意见.

我需要能够通过以下方式匹配名称:

1) 重音折叠:Jorn 匹配 Jörn,反之亦然
2) 替代拼写:Karl 匹配 Carl,反之亦然
3)缩短的表示(我相信我是用 SynonymFilterFactory 做到的):Sue 匹配 Susanne 等
4)Levenstein匹配:Jonn匹配John等
5) Soundex 匹配:Elin 和 Ellen

非常欢迎任何指导、批评或评论.请让我知道这是否可能......或者我可能只是在做白日梦.:)

<小时>

编辑

我还必须补充一点,我还有一个全名字段,以防有些人有很长的名字,例如其中一个帖子中的示例:Jon Paul 或 Del Carmen 也应该匹配 Jon Paul Del Carmen

而且由于这是一个新项目,我可以以任何我认为合适的方式修改架构和架构,因此限制非常有限.

解决方案

听起来您是在为需要非常松散地匹配的搜索语料库提供服务?

如果您这样做,您将需要选择您的字段并设置不同的提升来对您的结果进行排名.

因此在 solr 中有单独的复制"字段:

  • 一个用于准确全名的字段(带过滤器)
  • 带过滤器的多值字段 ASCIIFolding、小写...
  • 带有 SynonymFilterFactory ASCIIFolding 的多值字段,小写...
  • PhoneticFilterFactory(使用 CaverphoneDouble-Metaphone)

另见:更多非英语 Soundex 讨论

名称的同义词,我不知道是否有可用的公共同义词db.

模糊搜索,我没有发现它有用,它使用 Levenshtein Distance.

其他过滤器和索引获得更优质的搜索相关"结果.

名称中的 Unicode 字符可以使用 ASCIIFoldingFilterFactory

您正在预先描述预期用例的解决方案.

如果您想要高质量的结果,请计划调整您的搜索相关性

当尝试匹配同义词时,此调整将特别有价值,例如 MacDonald 和 McDonald(其 Levenshtein 距离比 Carl 和 Karl 大).

I've just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by their names.

After reading numerous posts and articles including:

How can I use Lucene for personal name (first name, last name) search?
http://dublincore.org/documents/1998/02/03/name-representation/
what's the best way to search a social network by prioritizing a users relationships first?
http://www.gossamer-threads.com/lists/lucene/java-user/120417
Lucene Index and Query Design Question - Searching People
Lucene Fuzzy Search for customer names and partial address

... and a few others I cannot find at-the-moment. And getting at-least indexing and basic search working in my machine I have devised the following scheme for user searching:

1) Have a first, second and third name field and index those with Solr
2) Use edismax as the requestParser for multi column searching
3) Use a combination of normalization filters such as: transliteration, latin-to-ascii convesrion, etc.
4) Finally use fuzzy search

Evidently, being very new to this I am unsure if the above is the best way to do it and would like to hear from experienced users who have a better idea than me in this field.

I need to be able to match names in the following ways:

1) Accent folding: Jorn matches Jörn and vise versa
2) Alternative spellings: Karl matches Carl and vice versa
3) Shortened representations (I believe I do this with the SynonymFilterFactory): Sue matches Susanne, etc.
4) Levenstein matching: Jonn matches John, etc.
5) Soundex matching: Elin and Ellen

Any guidance, criticisms or comments are very welcome. Please let me know if this is possible ... or perhaps I'm just day-dreaming. :)


EDIT

I must also add that I also have a fullname field in case some people have long names, as an example from one of the posts: Jon Paul or Del Carmen should also match Jon Paul Del Carmen

And since this is a new project, I can modify the schema and architecture any way I see fit so there are very limited restrictions.

解决方案

It sounds like you are catering for a corpus with searches that you need to match very loosely?

If you are doing that you will want to choose your fields and set different boosts to rank your results.

So have separate "copied" fields in solr:

  • one field for exact full name (with filters)
  • multivalued field with filters ASCIIFolding, Lowercase...
  • multivalued field with the SynonymFilterFactory ASCIIFolding, Lowercase...
  • PhoneticFilterFactory (with Caverphone or Double-Metaphone)

See Also: more non-english Soundex discussion

Synonyms for names, I don't know if there is a public synonym db available.

Fuzzy searching, I've not found it useful, it uses Levenshtein Distance.

Other filters and indexing get more superior "search relevant" results.

Unicode characters in names can be handled with the ASCIIFoldingFilterFactory

You are describing solutions up front for expected use cases.

If you want quality results, plan on tuning your Search Relevance

This tuning will be especially valuable, when attempting to match on synonyms, like MacDonald and McDonald (which has a larger Levenshtein distance than Carl and Karl).

这篇关于使用 Apache Solr 搜索名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆