在elasticsearch的较早位置为包含搜索查询的匹配项分配较高的分数 [英] Assign a higher score to matches containing the search query at an earlier position in elasticsearch

查看:139
本文介绍了在elasticsearch的较早位置为包含搜索查询的匹配项分配较高的分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题类似于我的其他问题

This question is similar to my other question enter link description here which Val answered.

我有一个包含3个文档的索引.

I have an index containing 3 documents.

    {
            "firstname": "Anne",
            "lastname": "Borg",
        }

    {
            "firstname": "Leanne",
            "lastname": "Ray"
        },

    {
            "firstname": "Anne",
            "middlename": "M",
            "lastname": "Stone"
        }

当我搜索"Ann"时,我希望Elastic返回所有这3个文档(因为它们在一定程度上都与"Ann"相匹配).但是,我希望Leanne Ray的得分(相关性排名)较低,因为搜索词安"在该文档中的出现位置要比其他两个文档中出现的要晚.

When I search for "Ann", I would like elastic to return all 3 of these documents (because they all match the term "Ann" to a degree). BUT, I would like Leanne Ray to have a lower score (relevance ranking) because the search term "Ann" appears at a later position in this document than the term appears in the other two documents.

这是我的索引设置...

Here are my index settings...

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "type": "custom",
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "token_chars": [
                        "letter",
                        "digit",
                        "custom"
                    ],
                    "custom_token_chars": "'-",
                    "min_gram": "1",
                    "type": "ngram",
                    "max_gram": "2"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "full_name": {
                "type": "text",
                "analyzer": "my_analyzer",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

以下查询带回了预期的文档,但归因于Leanne Ray比归因于Anne Borg.

The following query brings back the expected documents, but attributes a higher score to Leanne Ray than to Anne Borg.

{
    "query": {
        "bool": {
            "must": {
                "query_string": {
                    "query": "Ann",
                    "fields": ["full_name"]
                }
            },
            "should": {
                "match": {
                    "full_name": "Ann"}
            }
        }
    }
}

这是结果...

"hits": [
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "2",
            "_score": 6.6333585,
            "_source": {
                "firstname": "Anne",
                "middlename": "M",
                "lastname": "Stone"
            }
        },
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "1",
            "_score": 6.142234,
            "_source": {
                "firstname": "Leanne",
                "lastname": "Ray"
            }
        },
        {
            "_index": "contacts_4",
            "_type": "_doc",
            "_id": "3",
            "_score": 6.079495,
            "_source": {
                "firstname": "Anne",
                "lastname": "Borg"
            }
        }

一起使用ngram令牌过滤器 和ngram令牌生成器似乎可以解决此问题...

Using an ngram token filter and an ngram tokenizer together seems to fix this problem...

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "filter": [
                        "ngram"
                    ],
                    "tokenizer": "ngram"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },
            "full_name": {
                "type": "text",
                "analyzer": "my_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}

同一查询以预期的相对得分带回预期的结果. 为什么这样做?请注意,上面我使用的是带有小写过滤器的ngram标记器,唯一的区别是我使用的是ngram过滤器而不是小写的过滤器.

The same query brings back the expected results with the desired relative scoring. Why does this work? Note that above, I am using an ngram tokenizer with a lowercase filter and the only difference here is that I am using an ngram filter instead of the lowercase filter.

这是结果.请注意,Leanne Ray的得分要低于Anne Borg和Anne M Stone.

Here are the results. Notice that Leanne Ray scored lower than both Anne Borg and Anne M Stone, as desired.

"hits": [
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "3",
        "_score": 4.953257,
        "_source": {
            "firstname": "Anne",
            "lastname": "Borg"
        }
    },
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "2",
        "_score": 4.87168,
        "_source": {
            "firstname": "Anne",
            "middlename": "M",
            "lastname": "Stone"
        }
    },
    {
        "_index": "contacts_4",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0364896,
        "_source": {
            "firstname": "Leanne",
            "lastname": "Ray"
        }
    }

顺便说一句,当索引也包含其他文档时,此查询还会带回大量误报结果.并不是这样的问题,因为相对于理想命中的得分而言,误报得分很低.但是仍然不理想.例如,如果我在文档中添加{firstname:Gideon,lastname:Grossma},则上面的查询也会在结果集中返回该文档-尽管得分比包含字符串"Ann"的文档要低得多/p>

By the way, this query also brings back a whole lot of false positive results when the index contains other documents as well. It's not such a problem becasuethese false positives have very low scores relative to the scores of the desirable hits. But still not ideal. For example, if I add {firstname: Gideon, lastname: Grossma} to the document, the above query will bring back that document in the result set as well - albeit with a much lower score than the documents containing the string "Ann"

推荐答案

答案与链接线程中的相同.由于您正在对所有索引数据进行ngram处理,因此AnnAnne的工作方式相同,但是您将获得完全相同的响应(请参见下文),但得分不同:

The answer is the same as in the linked thread. Since you're ngraming all the indexed data, it works the same way with Ann as with Anne, You'll get the exact same response (see below), with different scores, though:

"hits" : [
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5Jr-DHIBhYuDqANwSeiw",
    "_score" : 4.8442974,
    "_source" : {
      "firstname" : "Anne",
      "lastname" : "Borg"
    }
  },
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5pr-DHIBhYuDqANwSeiw",
    "_score" : 4.828779,
    "_source" : {
      "firstname" : "Anne",
      "middlename" : "M",
      "lastname" : "Stone"
    }
  },
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5Zr-DHIBhYuDqANwSeiw",
    "_score" : 0.12874341,
    "_source" : {
      "firstname" : "Leanne",
      "lastname" : "Ray"
    }
  }
]

更新

这是修改后的查询,可用于检查零件(即annanne).再次,套管在这里没有区别,因为分析仪在索引之前将所有内容都小写.

Here is a modified query that you can use to check for parts (i.e. ann vs anne). Again, the casing makes no difference here, since the analyzer lowercases everything before indexing.

{
  "query": {
    "bool": {
      "must": {
        "query_string": {
          "query": "ann",
          "fields": [
            "full_name"
          ]
        }
      },
      "should": [
        {
          "match_phrase_prefix": {
            "firstname": {
              "query": "ann",
              "boost": "10"
            }
          }
        },
        {
          "match_phrase_prefix": {
            "lastname": {
              "query": "ann",
              "boost": "10"
            }
          }
        }
      ]
    }
  }
}

这篇关于在elasticsearch的较早位置为包含搜索查询的匹配项分配较高的分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆