使用Apache Solr搜索名称 [英] Searching names with Apache Solr

查看:109
本文介绍了使用Apache Solr搜索名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚涉足看似简单但极其复杂的搜索世界.对于一个应用程序,我需要构建一种搜索机制来按用户名搜索用户.

阅读了许多帖子和文章后,包括:

我如何使用Lucene进行个人名称(名字,姓氏)搜索?
http://dublincore.org/documents/1998/02/03/name -representation/
什么是最好的方法通过优先考虑用户关系来搜索社交网络?
http://www.gossamer-threads.com/lists/lucene/java-user/120417
Lucene索引和查询设计问题-搜索人员
Lucene Fuzzy搜索客户名称和部分地址

...以及其他一些我一时找不到的东西.为了在我的机器上实现最少的索引编制和基本搜索,我为用户搜索设计了以下方案:

1)有一个名字,名字和名字字段,并用Solr对其进行索引
2)使用edismax作为多列搜索的requestParser
3)结合使用归一化过滤器,例如:音译,拉丁语到ascii转换等.
4)最后使用模糊搜索

显然,对此我很陌生,我不确定上述方法是否是最好的方法,并且希望听到有经验的用户在此领域比我有更好的想法.

我需要能够通过以下方式匹配名称:

1)口音折叠:乔恩(Jorn)与约恩(Jörn)相匹配,反之亦然. 2)其他拼写:卡尔匹配卡尔,反之亦然
3)简化表示(我相信我用SynonymFilterFactory做到了):Sue匹配Susanne,等等.
4)Levenstein匹配:Jonn匹配John,等等.
5)Soundex匹配:Elin和Ellen

任何指导,批评或评论都非常欢迎.请让我知道这是否可行...或者我只是在做白日梦. :)


编辑

我还必须补充一点,如果某些人的名字很长,我也要有一个全名字段,例如以下文章之一:Jon Paul或Del Carmen也应与Jon Paul Del Carmen相匹配

由于这是一个新项目,因此我可以按照自己认为合适的方式修改架构和体系结构,因此限制非常有限.

解决方案

听起来您正在满足一个语料库的需要非常宽松地匹配搜索?

如果这样做,您将需要选择字段并设置不同的提升来对结果进行排名.

因此在solr中有单独的复制"字段:

  • 一个用于全名的字段(带有过滤器)
  • 带有过滤器ASCIIFolding,小写...的多值字段
  • 具有SynonymFilterFactory ASCIIFolding,小写字母的多值字段
  • PhoneticFilterFactory(带有 Caverphone >另请参阅:更多非英语的Soundex讨论

    名称的同义词,我不知道是否有可用的公共同义词数据库.

    模糊搜索,我发现它没有用,它使用Levenshtein距离.

    其他过滤器和索引可获得更出色的搜索相关"结果.

    可以使用 ASCIIFoldingFilterFactory

    处理名称中的

    Unicode字符. >

    您正在为预期的用例预先描述解决方案.

    如果您想要高质量的结果,请计划调整搜索相关性

    当尝试匹配同义词时,例如MacDonald和McDonald(其Levenshtein距离比Carl和Karl大)时,这种调整将特别有价值.

    I've just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by their names.

    After reading numerous posts and articles including:

    How can I use Lucene for personal name (first name, last name) search?
    http://dublincore.org/documents/1998/02/03/name-representation/
    what's the best way to search a social network by prioritizing a users relationships first?
    http://www.gossamer-threads.com/lists/lucene/java-user/120417
    Lucene Index and Query Design Question - Searching People
    Lucene Fuzzy Search for customer names and partial address

    ... and a few others I cannot find at-the-moment. And getting at-least indexing and basic search working in my machine I have devised the following scheme for user searching:

    1) Have a first, second and third name field and index those with Solr
    2) Use edismax as the requestParser for multi column searching
    3) Use a combination of normalization filters such as: transliteration, latin-to-ascii convesrion, etc.
    4) Finally use fuzzy search

    Evidently, being very new to this I am unsure if the above is the best way to do it and would like to hear from experienced users who have a better idea than me in this field.

    I need to be able to match names in the following ways:

    1) Accent folding: Jorn matches Jörn and vise versa
    2) Alternative spellings: Karl matches Carl and vice versa
    3) Shortened representations (I believe I do this with the SynonymFilterFactory): Sue matches Susanne, etc.
    4) Levenstein matching: Jonn matches John, etc.
    5) Soundex matching: Elin and Ellen

    Any guidance, criticisms or comments are very welcome. Please let me know if this is possible ... or perhaps I'm just day-dreaming. :)


    EDIT

    I must also add that I also have a fullname field in case some people have long names, as an example from one of the posts: Jon Paul or Del Carmen should also match Jon Paul Del Carmen

    And since this is a new project, I can modify the schema and architecture any way I see fit so there are very limited restrictions.

    解决方案

    It sounds like you are catering for a corpus with searches that you need to match very loosely?

    If you are doing that you will want to choose your fields and set different boosts to rank your results.

    So have separate "copied" fields in solr:

    • one field for exact full name (with filters)
    • multivalued field with filters ASCIIFolding, Lowercase...
    • multivalued field with the SynonymFilterFactory ASCIIFolding, Lowercase...
    • PhoneticFilterFactory (with Caverphone or Double-Metaphone)

    See Also: more non-english Soundex discussion

    Synonyms for names, I don't know if there is a public synonym db available.

    Fuzzy searching, I've not found it useful, it uses Levenshtein Distance.

    Other filters and indexing get more superior "search relevant" results.

    Unicode characters in names can be handled with the ASCIIFoldingFilterFactory

    You are describing solutions up front for expected use cases.

    If you want quality results, plan on tuning your Search Relevance

    This tuning will be especially valuable, when attempting to match on synonyms, like MacDonald and McDonald (which has a larger Levenshtein distance than Carl and Karl).

    这篇关于使用Apache Solr搜索名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆