ElasticSearch - 搜索人名 [英] ElasticSearch - Searching For Human Names

查看:51
本文介绍了ElasticSearch - 搜索人名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型姓名数据库,主要来自苏格兰.我们目前正在制作一个原型来替换执行搜索的现有软件.这仍在生产中,我们的目标是使我们的结果尽可能接近同一搜索的当前结果.

I have a large database of names, primarily from Scotland. We're currently producing a prototype to replace an existing piece of software which carries out the search. This is still in production and we're aiming to get our results as closes as possible to the current results of the same search.

我希望有人能帮助我,我正在进入 Elastic Search 搜索,查询是Michael Heaney",我得到了一些疯狂的结果.当前搜索返回两个主要姓氏,它们是 - Heaney"和Heavey"都以Michael"为前名,我可以在 Elastic Search 中获得Heaney"结果但是我也无法获得Heavey"和 ES返回没有姓氏迈克尔"的人,但我很感激这是因为它是模糊查询的一部分.我知道这是一个狭窄的用例,因为它只是一次搜索,但获得此结果并了解如何获得它会有所帮助.

I was hoping someone could help me out, I am entering in a search into Elastic Search, the query is "Michael Heaney", I get some wild results. The current search returns two main surnames, these are - "Heaney" and "Heavey" all with the forename of "Michael", I can get the "Heaney" results in Elastic Search however I can't obtain "Heavey" and ES also returns people without the surname "Michael" however I appreciate that that's due to it being part of the fuzzy query. I know this is a narrow use case, as it's only one search but getting this result and knowing how I can obtain it will help.

谢谢.

映射

{
   "jr": {
    "_all": {
        "enabled": true,
        "index_analyzer": "index_analyzer",
        "search_analyzer": "search_analyzer"
    },
    "properties": {
        "pty_forename": {
            "type": "string",
            "index": "analyzed",
            "boost": 2,
            "index_analyzer": "index_analyzer",
            "search_analyzer": "search_analyzer",
            "store": "yes"
        },
        "pty_full_name": {
            "type": "string",
            "index": "analyzed",
            "boost": 4,
            "index_analyzer": "index_analyzer",
            "search_analyzer": "search_analyzer",
            "store": "yes"
        },
        "pty_surname": {
            "type": "string",
            "index": "analyzed",
            "boost": 4,
            "index_analyzer": "index_analyzer",
            "search_analyzer": "search_analyzer",
            "store": "yes"
        }
     }
   }
}'

索引设置

{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0,
    "analysis": {
        "analyzer": {
            "index_analyzer": {
                "tokenizer": "standard",
                "filter": [
                    "standard",
                    "my_delimiter",
                    "lowercase",
                    "stop",
                    "asciifolding",
                    "porter_stem",
                    "my_metaphone"
                ]
            },
            "search_analyzer": {
                "tokenizer": "standard",
                "filter": [
                    "standard",
                    "my_metaphone",
                    "synonym",
                    "lowercase",
                    "stop",
                    "asciifolding",
                    "porter_stem"
                ]
            }
        },
        "filter": {
            "synonym": {
                "type": "synonym",
                "synonyms_path": "synonyms/synonyms.txt"
            },
            "my_delimiter": {
                "type": "word_delimiter",
                "generate_word_parts": true,
                "catenate_words": false,
                "catenate_numbers": false,
                "catenate_all": false,
                "split_on_case_change": false,
                "preserve_original": false,
                "split_on_numerics": false,
                "stem_english_possessive": false
            },
            "my_metaphone": {
                "type": "phonetic",
                "encoder": "metaphone",
                "replace": false
            }
        }
     }
   }
}'

模糊

{
"from":0, "size":100,
"query": {
    "bool": {
        "should": [
            {
                "fuzzy": {
                    "pty_surname": {
                        "min_similarity": 0.2,
                        "value": "Heaney",
                        "prefix_length": 0,
                        "boost": 5
                    }
                }
            },
            {
                "fuzzy": {
                    "pty_forename": {
                        "min_similarity": 1,
                        "value": "Michael",
                        "prefix_length": 0,
                        "boost": 1
                    }
                }
            }
        ]
     }
  }
}

推荐答案

首先,我在 Play 中重新创建了您当前的配置:https://www.found.no/play/gist/867785a709b4869c5543

First, I recreated your current configuration in Play: https://www.found.no/play/gist/867785a709b4869c5543

如果您去那里,请切换到分析"选项卡以查看文本是如何转换的:

If you go there, switch to the "Analysis"-tab to see how the text is transformed:

注意,例如 Heaney 最终被标记为 [hn, heanei]search_analyzer[HN,heanei]index_analyzer.请注意变音术语的大小写差异.因此,那个不匹配.

Note, for example that Heaney ends up tokenized as [hn, heanei] with the search_analyzer and as [HN, heanei] with the index_analyzer. Note the case-difference for the metaphone-term. Thus, that one is not matching.

fuzzy-query 不做查询时间文本分析.因此,您最终将 Heaveyheanei 进行比较.这比您的参数允许的Damerau-Levenshtein 距离长.

The fuzzy-query does not do query time text analysis. Thus, you end up comparing Heavey with heanei. This has a Damerau-Levenshtein distance longer than what your parameters allow.

您真正想做的是使用match 的模糊功能.Match does 做查询时间文本分析,并且有一个 fuzziness 参数.

What you really want to do is using the fuzzy functionality of match. Match does do query time text analysis, and has a fuzziness-parameter.

至于 fuzziness,这在 Lucene 4 中发生了一些变化.以前,它通常被指定为浮点数.现在应该将其指定为允许的距离.有一个突出的拉取请求来澄清:https://github.com/elasticsearch/elasticsearch/pull/4332/files

As for the fuzziness, this changed a bit in Lucene 4. Before, it was typically specified as a float. Now it should be specified as the allowed distance. There's an outstanding pull request to clarify that: https://github.com/elasticsearch/elasticsearch/pull/4332/files

让人们没有名字Michael的原因是你在做一个bool.should.这具有 OR 语义.一场比赛就足够了,但从得分上来说,比赛越多越好.

The reason why you are getting people without the forename Michael is that you are doing a bool.should. This has OR-semantics. It's sufficient that one matches, but scoring-wise it's better the more that matches.

最后,将所有过滤合并到同一个术语中不一定是最好的方法.例如,您无法知道和提高准确的拼写.您应该考虑的是使用 multi_field 来处理领域的许多方面.

Lastly, combining all that filtering into the same term is not necessarily the best approach. For example, you cannot know and boost exact spellings. What you should consider is using a multi_field to process the field in many ways.

这是您可以使用的示例,下面使用 curl 命令重新创建它.但是,我会完全跳过使用搬运工"词干分析器.我保留它只是为了展示 multi_field 是如何工作的.使用匹配、模糊匹配和语音匹配的组合应该会让你走得更远.(确保在进行语音匹配时不允许模糊 - 否则您将获得无用的模糊匹配.:-)

Here's an example you can play with, with the curl commands to recreate it below. I'd skip using the "porter" stemmer entirely for this, however. I kept it just to show how multi_field works. Using a combination of match, match with fuzziness and phonetic matching should get you far. (Make sure you don't allow fuzziness when you do phonetic matching - or you'll get uselessly fuzzy matching. :-)

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
    "settings": {
        "analysis": {
            "text": [
                "Michael",
                "Heaney",
                "Heavey"
            ],
            "analyzer": {
                "metaphone": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "my_metaphone"
                    ]
                },
                "porter": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "porter_stem"
                    ]
                }
            },
            "filter": {
                "my_metaphone": {
                    "encoder": "metaphone",
                    "replace": false,
                    "type": "phonetic"
                }
            }
        }
    },
    "mappings": {
        "jr": {
            "properties": {
                "pty_surename": {
                    "type": "multi_field",
                    "fields": {
                        "pty_surename": {
                            "type": "string",
                            "analyzer": "simple"
                        },
                        "metaphone": {
                            "type": "string",
                            "analyzer": "metaphone"
                        },
                        "porter": {
                            "type": "string",
                            "analyzer": "porter"
                        }
                    }
                }
            }
        }
    }
}'


# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"jr"}}
{"pty_surname":"Heaney"}
{"index":{"_index":"play","_type":"jr"}}
{"pty_surname":"Heavey"}
'

# Do searches

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "bool": {
            "should": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "pty_surname": {
                                        "query": "heavey"
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surname": {
                                        "query": "heavey",
                                        "fuzziness": 1
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surename.metaphone": {
                                        "query": "heavey"
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surename.porter": {
                                        "query": "heavey"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}
'

这篇关于ElasticSearch - 搜索人名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆