弹性搜索 - 寻找人名 [英] ElasticSearch - Searching For Human Names

查看:100
本文介绍了弹性搜索 - 寻找人名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的名字数据库,主要来自苏格兰。我们正在生产一个原型来替代现有的执行搜索的软件。这还在生产中,我们的目标是尽可能关闭当前搜索结果。



我希望有人能帮助我我正在搜索到弹性搜索,查询是Michael Heaney,我得到一些野心的结果。目前的搜索返回两个主要姓氏,这些是Heaney和Heavey,所有名字都是Michael,我可以得到Heaney结果在弹性搜索,但是我无法获得Heavey和ES也返回没有姓氏Michael的人,但是我很感激这是因为它是模糊查询的一部分。我知道这是一个狭窄的用例,因为它只是一个搜索,但得到这个结果,知道如何获得它将有所帮助。



谢谢。



映射

  {
jr :{
_all:{
enabled:true,
index_analyzer:index_analyzer,
search_analyzer:search_analyzer
}
properties:{
pty_forename:{
type:string,
index:analyze,
boost 2,
index_analyzer:index_analyzer,
search_analyzer:search_analyzer,
store:yes
},
pty_full_name :{
type:string,
index:analyze,
boost:4,
index_analyzer:index_analyzer,
search_analyzer:search_analyzer,
store:yes
},
pty_surname:{
type:string,
index:analyze,
boost:4,
index_analyzer:index_analyzer,
search_analyzer:search_analyzer,
store:yes
}
}
}
} '

索引设置



pre> {
settings:{
number_of_shards:2,
number_of_replicas:0,
analysis :{
analyzer:{
index_analyzer:{
tokenizer:standard,
filter:[
standard,
my_delimiter,
小写,
停止,
asciifolding,
porter_stem,
my_metaphone
]
},
search_analyzer:{
tokenizer:standard,
filter:[
standard,
my_metaphone,
synonym
小写,
停止,
asciifolding,
porter_stem
]
}
},
filter:{
synonym:{
type:synonym,
peers_path:同义词/同义词txt
},
my_delimiter:{
type:word_delimiter,
generate_word_parts:true,
catenate_words:false,
catenate_numbers:false,
catenate_all:false,
split_on_case_change:false,
preserve_original:false,
split _on_numerics:false,
stem_english_possessive:false
},
my_metaphone:{
type:phonetic,
encoder metaphone,
replace:false
}
}
}
}
}'
/ pre>

模糊

  
from:0,size:100,
query:{
bool:{
should:[
{
fuzzy:{
pty_surname:{
min_similarity:0.2,
value:Heaney,
prefix_length:0,
boost:5
}
}
},
{
fuzzy:{
pty_forename:{
min_similarity:1,
value:Michael,
prefix_length:0,
boost:1
}
}
}
]
}
}
}


解决方案

首先,我在Play中重新创建了当前的配置: https://www.found.no/play/gist/867785a709b4869c5543



如果你去那里,切换到分析-tab,看看文字是如何变换的:



请注意,例如 Heaney 最终被标记为 [hn,heanei] search_analyzer [HN,heanei] index_analyzer 。注意metaphone术语的病例差异。因此,那个不匹配。



fuzzy -query不执行查询时间文本分析。因此,您最终将 Heavey heanei 进行比较。这有一个比您参数允许的更长的 Damerau-Levenshtein距离



您真正想要做的是使用 match 的模糊功能。匹配执行查询时间文本分析,并具有模糊性参数。



对于 fuzziness ,这在Lucene 4中有所改变。之前,它通常被指定为浮点数。现在应该被指定为允许的距离。有一个很好的提出请求来澄清: https://github.com/elasticsearch/elasticsearch/pull/4332/files



您没有姓名的原因 Michael 的原因是您正在做一个 bool.should 。这具有OR语义。一个匹配的是足够的,但是得分越高,匹配的越多越好。



最后,将所有过滤组合到同一个术语中并不一定是最好的方法。例如,您不能知道并提升精确的拼写。您应该考虑使用多字段来处理该领域在许多方面。



这里有一个你可以玩的例子使用curl命令在下面重新创建它。不过我会跳过这个搬运工的手段。我只是为了展示multi_field如何工作。使用匹配,匹配模糊和语音匹配的组合应该让你远。 (确保您在做语音匹配时不要模糊,否则您将无法模糊匹配: - )

 ##/ bin / bash 

export ELASTICSEARCH_ENDPOINT =http:// localhost:9200

#创建索引

curl -XPUT $ ELASTICSEARCH_ENDPOINT / play-d'{
settings:{
analysis:{
text:[
Michael,
Heaney
Heavey
],
analyzer:{
metaphone:{
type:custom,
$$$$$$$$$$
:custom,
tokenizer:standard,
filter:[
smallcase,
porter_stem
]
}
},
过滤器:{
my_metaphone:{
encoder:metaphone
replace:false,
type:phonetic
}
}
}
},
mappings {
jr:{
properties:{
pty_surename:{
type:multi_field,
fields b $ bpty_surename:{
type:string,
analyzer:simple
},
metaphone:{
type:string,
analyzer:metaphone
},
porter:{
type:string,
analyzer:porter
}
}
}
}
}
}
}'


#索引文件
curl -XPOST$ ELASTICSEARCH_ENDPOINT / _bulk?refresh = true-d'
{index:{ _index:play,_ type:jr}}
{pty_surname:Heaney}
{index:{_ index:play,_ type :jr}}
{pty_surname:Heavey}
'

#执行搜索

curl -XPOST$ ELASTICSEARCH_ENDPOINT / _search?pretty-d'
{
查询:{
bool:{
should:[
{
bool:{
should:[
{
match:{
pty_surname:{
query:heavey

}
{
match:{
pty_surname:{
query:heavey,
fuzziness:1
}
}
},
{
match:{
pty_surename.metaphone:{
query:heavey
}
}
},
{
match:{
pty_surename.porter:{
查询:heavey
}
}
}
]
}
}
]
}
}
}
'


I have a large database of names, primarily from Scotland. We're currently producing a prototype to replace an existing piece of software which carries out the search. This is still in production and we're aiming to get our results as closes as possible to the current results of the same search.

I was hoping someone could help me out, I am entering in a search into Elastic Search, the query is "Michael Heaney", I get some wild results. The current search returns two main surnames, these are - "Heaney" and "Heavey" all with the forename of "Michael", I can get the "Heaney" results in Elastic Search however I can't obtain "Heavey" and ES also returns people without the surname "Michael" however I appreciate that that's due to it being part of the fuzzy query. I know this is a narrow use case, as it's only one search but getting this result and knowing how I can obtain it will help.

Thanks.

Mapping

{
   "jr": {
    "_all": {
        "enabled": true,
        "index_analyzer": "index_analyzer",
        "search_analyzer": "search_analyzer"
    },
    "properties": {
        "pty_forename": {
            "type": "string",
            "index": "analyzed",
            "boost": 2,
            "index_analyzer": "index_analyzer",
            "search_analyzer": "search_analyzer",
            "store": "yes"
        },
        "pty_full_name": {
            "type": "string",
            "index": "analyzed",
            "boost": 4,
            "index_analyzer": "index_analyzer",
            "search_analyzer": "search_analyzer",
            "store": "yes"
        },
        "pty_surname": {
            "type": "string",
            "index": "analyzed",
            "boost": 4,
            "index_analyzer": "index_analyzer",
            "search_analyzer": "search_analyzer",
            "store": "yes"
        }
     }
   }
}'

Index Settings

{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0,
    "analysis": {
        "analyzer": {
            "index_analyzer": {
                "tokenizer": "standard",
                "filter": [
                    "standard",
                    "my_delimiter",
                    "lowercase",
                    "stop",
                    "asciifolding",
                    "porter_stem",
                    "my_metaphone"
                ]
            },
            "search_analyzer": {
                "tokenizer": "standard",
                "filter": [
                    "standard",
                    "my_metaphone",
                    "synonym",
                    "lowercase",
                    "stop",
                    "asciifolding",
                    "porter_stem"
                ]
            }
        },
        "filter": {
            "synonym": {
                "type": "synonym",
                "synonyms_path": "synonyms/synonyms.txt"
            },
            "my_delimiter": {
                "type": "word_delimiter",
                "generate_word_parts": true,
                "catenate_words": false,
                "catenate_numbers": false,
                "catenate_all": false,
                "split_on_case_change": false,
                "preserve_original": false,
                "split_on_numerics": false,
                "stem_english_possessive": false
            },
            "my_metaphone": {
                "type": "phonetic",
                "encoder": "metaphone",
                "replace": false
            }
        }
     }
   }
}'

Fuzzy

{
"from":0, "size":100,
"query": {
    "bool": {
        "should": [
            {
                "fuzzy": {
                    "pty_surname": {
                        "min_similarity": 0.2,
                        "value": "Heaney",
                        "prefix_length": 0,
                        "boost": 5
                    }
                }
            },
            {
                "fuzzy": {
                    "pty_forename": {
                        "min_similarity": 1,
                        "value": "Michael",
                        "prefix_length": 0,
                        "boost": 1
                    }
                }
            }
        ]
     }
  }
}

解决方案

First, I recreated your current configuration in Play: https://www.found.no/play/gist/867785a709b4869c5543

If you go there, switch to the "Analysis"-tab to see how the text is transformed:

Note, for example that Heaney ends up tokenized as [hn, heanei] with the search_analyzer and as [HN, heanei] with the index_analyzer. Note the case-difference for the metaphone-term. Thus, that one is not matching.

The fuzzy-query does not do query time text analysis. Thus, you end up comparing Heavey with heanei. This has a Damerau-Levenshtein distance longer than what your parameters allow.

What you really want to do is using the fuzzy functionality of match. Match does do query time text analysis, and has a fuzziness-parameter.

As for the fuzziness, this changed a bit in Lucene 4. Before, it was typically specified as a float. Now it should be specified as the allowed distance. There's an outstanding pull request to clarify that: https://github.com/elasticsearch/elasticsearch/pull/4332/files

The reason why you are getting people without the forename Michael is that you are doing a bool.should. This has OR-semantics. It's sufficient that one matches, but scoring-wise it's better the more that matches.

Lastly, combining all that filtering into the same term is not necessarily the best approach. For example, you cannot know and boost exact spellings. What you should consider is using a multi_field to process the field in many ways.

Here's an example you can play with, with the curl commands to recreate it below. I'd skip using the "porter" stemmer entirely for this, however. I kept it just to show how multi_field works. Using a combination of match, match with fuzziness and phonetic matching should get you far. (Make sure you don't allow fuzziness when you do phonetic matching - or you'll get uselessly fuzzy matching. :-)

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
    "settings": {
        "analysis": {
            "text": [
                "Michael",
                "Heaney",
                "Heavey"
            ],
            "analyzer": {
                "metaphone": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "my_metaphone"
                    ]
                },
                "porter": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "porter_stem"
                    ]
                }
            },
            "filter": {
                "my_metaphone": {
                    "encoder": "metaphone",
                    "replace": false,
                    "type": "phonetic"
                }
            }
        }
    },
    "mappings": {
        "jr": {
            "properties": {
                "pty_surename": {
                    "type": "multi_field",
                    "fields": {
                        "pty_surename": {
                            "type": "string",
                            "analyzer": "simple"
                        },
                        "metaphone": {
                            "type": "string",
                            "analyzer": "metaphone"
                        },
                        "porter": {
                            "type": "string",
                            "analyzer": "porter"
                        }
                    }
                }
            }
        }
    }
}'


# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"jr"}}
{"pty_surname":"Heaney"}
{"index":{"_index":"play","_type":"jr"}}
{"pty_surname":"Heavey"}
'

# Do searches

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "bool": {
            "should": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "pty_surname": {
                                        "query": "heavey"
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surname": {
                                        "query": "heavey",
                                        "fuzziness": 1
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surename.metaphone": {
                                        "query": "heavey"
                                    }
                                }
                            },
                            {
                                "match": {
                                    "pty_surename.porter": {
                                        "query": "heavey"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}
'

这篇关于弹性搜索 - 寻找人名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆