Elasticsearch：文档pt.2中具有自定义得分字段的影响力得分 [英] Elasticsearch: Influence scoring with custom score field in document pt.2

查看：80 发布时间：2020/10/28 2:16:13 elasticsearch

本文介绍了Elasticsearch：文档pt.2中具有自定义得分字段的影响力得分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

具有以下文档：

{
  "created_at" : "2017-07-31T20:30:14-04:00",
  "description" : null,
  "height" : 3213,
  "id" : "1",
  "tags" : [
    {
      "confidence" : 65.48948436785749,
      "tag" : "beach"
    },
    {
      "confidence" : 57.31950504425406,
      "tag" : "sea"
    },
    {
      "confidence" : 43.58207236617374,
      "tag" : "coast"
    },
    {
      "confidence" : 35.6857910950816,
      "tag" : "sand"
    },
    {
      "confidence" : 33.660057321079655,
      "tag" : "landscape"
    },
    {
      "confidence" : 32.53252312423727,
      "tag" : "sky"
    }
  ],
  "width" : 5712,
  "color" : "#0C0A07",
  "boost_multiplier" : 1
}

和

{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "2",
  "tags" : [
    {
      "confidence" : 84.09123410403951,
      "tag" : "mountain"
    },
    {
      "confidence" : 56.412795342449456,
      "tag" : "valley"
    },
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    {
      "confidence" : 40.51100450186575,
      "tag" : "mountains"
    },
    {
      "confidence" : 33.14263528292239,
      "tag" : "sky"
    },
    {
      "confidence" : 31.064394646169404,
      "tag" : "peak"
    },
    {
      "confidence" : 29.372,
      "tag" : "natural elevation"
    }
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 1
}

我想获得基于每个标签的置信度值计算的_score。例如，如果您搜索 mountain，则显然应该仅返回ID为1的文档；如果您搜索 landscape，则得分2应当高于1，因为景观对2的置信度高于1（48.36 vs 33.66）。如果您搜索 coast landscape，则此时间得分1应该高于2，因为doc 1在标签数组中同时包含了Coast和landscape。我还想将分数与 boost_multiplier相乘，以提高某些文档的性能。

I want to get the _score calculated based on the confidence values for each tag. For example if you search "mountain" it should return only doc with id 1 obviously, if you search "landscape", score of 2 should be higher then 1, as confidence of landscape in 2 is higher than 1 (48.36 vs 33.66). If you search for "coast landscape", this time score of 1 should be higher than 2, because doc 1 has both coast and landscape in the tags array. I also want to multiply the score with "boost_multiplier" to boost some documents against others.

我在SO中找到了这个问题， Elasticsearch：文档中具有自定义评分字段的影响力得分

I found this question in SO, Elasticsearch: Influence scoring with custom score field in document

但是当我尝试接受的解决方案时（i在我的ES服务器中启用脚本），则无论搜索字词如何，它都返回具有_score 1.0的两个文档。这是我尝试的查询：

But when I tried the accepted solution (i enabled scripting in my ES server), it returns both documents with having _score 1.0, regardless the search term. Here is my query that I tried:

{
  "query": {
    "nested": {
      "path": "tags",
      "score_mode": "sum",
      "query": {
        "function_score": {
          "query": {
            "match": {
              "tags.tag": "coast landscape"
            }
          },
          "script_score": {
            "script": "doc[\"confidence\"].value"
          }
        }
      }
    }
  }
}

我也尝试了@yahermann在评论中建议的内容，将 script_score替换为 field_value_factor：{ field ： confidence}，结果仍然相同。知道为什么它会失败，或者有更好的方法吗？

I also tried what @yahermann suggested in the comments, replacing "script_score" with "field_value_factor" : { "field" : "confidence" }, still the same result. Any idea why it fails, or is there better way to do it?

只是为了完整介绍，这是我使用的映射定义：

Just to have complete picture, here is the mapping definition that I've used:

{
  "mappings": {
    "photo": {
      "properties": {
        "created_at": {
          "type": "date"
        },
        "description": {
          "type": "text"
        },
        "height": {
          "type": "short"
        },
        "id": {
          "type": "keyword"
        },
        "tags": {
          "type": "nested",
          "properties": {
            "tag": { "type": "string" },
            "confidence": { "type": "float"}
          }
        },
        "width": {
          "type": "short"
        },
        "color": {
          "type": "string"
        },
        "boost_multiplier": {
          "type": "float"
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1
  }
}

更新
按照下面@Joanna的回答，我尝试了查询，但实际上，无论我将什么放入匹配查询中， Coast，foo，bar，它总是返回两个文件都带有_score 1.0的文档，我在Docker中的elasticsearch 2.4.6、5.3、5.5.1上尝试过。这是我得到的响应：

UPDATE Following the answer of @Joanna below, I tried the query, but in fact, whatever I put in match query, coast, foo, bar, it always return both documents with _score 1.0 for both of them, I tried it on elasticsearch 2.4.6, 5.3, 5.5.1 in Docker. Here is the response I get:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635

{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "2",
  "tags" : [
    {
      "confidence" : 84.09123410403951,
      "tag" : "mountain"
    },
    {
      "confidence" : 56.412795342449456,
      "tag" : "valley"
    },
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    {
      "confidence" : 40.51100450186575,
      "tag" : "mountains"
    },
    {
      "confidence" : 33.14263528292239,
      "tag" : "sky"
    },
    {
      "confidence" : 31.064394646169404,
      "tag" : "peak"
    },
    {
      "confidence" : 29.372,
      "tag" : "natural elevation"
    }
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
  "created_at" : "2017-07-31T20:30:14-04:00",
  "description" : null,
  "height" : 3213,
  "id" : "1",
  "tags" : [
    {
      "confidence" : 65.48948436785749,
      "tag" : "beach"
    },
    {
      "confidence" : 57.31950504425406,
      "tag" : "sea"
    },
    {
      "confidence" : 43.58207236617374,
      "tag" : "coast"
    },
    {
      "confidence" : 35.6857910950816,
      "tag" : "sand"
    },
    {
      "confidence" : 33.660057321079655,
      "tag" : "landscape"
    },
    {
      "confidence" : 32.53252312423727,
      "tag" : "sky"
    }
  ],
  "width" : 5712,
  "color" : "#0C0A07",
  "boost_multiplier" : 1
}
}]}}

UPDATE-2
我在SO上找到了这个： Elasticsearch： function_score用 boost_mode：替换。忽略函数得分

它基本上说，如果函数不匹配，则返回1。这很有意义，但我正在运行查询相同的文档。

It basically says, if function doesn't match, it returns 1. That makes sense, but I'm running the query for the same docs. That's confusing.

最终更新
最后，我发现了问题，我很愚蠢。 ES101，如果您发送GET请求以搜索api，它将返回所有得分为1.0的文档：）您应该发送POST请求...非常感谢@Joanna，它的工作原理非常棒！！！

FINAL UPDATE Finally I found the problem, stupid me. ES101, if you send GET request to search api, it returns all documents with score 1.0 :) You should send POST request... Thx a lot @Joanna, it works perfectly!!!

推荐答案

您可以尝试以下查询-它结合了得分和以下两种：置信度和 boost_multiplier 字段：

You may try this query - it combines scoring with both: confidence and boost_multiplier fields:

{
  "query": {
    "function_score": {
        "query": {
            "bool": {
                "should": [{
                    "nested": {
                      "path": "tags",
                      "score_mode": "sum",
                      "query": {
                        "function_score": {
                          "query": {
                            "match": {
                              "tags.tag": "landscape"
                            }
                          },
                          "field_value_factor": {
                            "field": "tags.confidence",
                            "factor": 1,
                            "missing": 0
                          }
                        }
                      }
                    }
                }]
            }
        },
        "field_value_factor": {
            "field": "boost_multiplier",
            "factor": 1,
            "missing": 0
        }
      }
    }
}

当我用海岸词进行搜索时，它会返回：

When I search with coast term - it returns:

id = 1

文档，因为只有这个有这个术语，并且得分是 _ score：100.27469 。

document with id=1 as only this one has this term, and the scoring is "_score": 100.27469.

当我使用 landscape 搜索时术语-它返回两个文档：

When I search with landscape term - it returns two documents:

个文档，其中 id = 2 并在 _score中评分：85.83046

文档的 id = 1 并得分 _score：59.7339

document with id=2 and scoring "_score": 85.83046
document with id=1 and scoring "_score": 59.7339

作为 id = 2的文档的置信度字段的值较高，得分更高。

As document with id=2 has higher value of confidence field, it gets higher scoring.

当我使用海岸景观术语进行搜索时-它返回两个文档：

When I search with coast landscape term - it returns two documents:

文档 id = 1 并为 _score评分：160.00859

文档 id = 2 并得分 _score：85.83046

document with id=1 and scoring "_score": 160.00859
document with id=2 and scoring "_score": 85.83046

尽管文档的 id = 2 具有较高的 confidence 字段值，具有 id = 1 的文档具有匹配的单词，因此得到得分更高。通过更改 factor：1 参数的值，您可以确定信心应该对结果有多大影响。


Although document with id=2 has higher value of confidence field, document with id=1 has both matching words so it gets much higher scoring. By changing the value of "factor": 1 parameter, you can decide how much confidence should influence the results.
在索引新文档时会发生更有趣的事情：假设它与 id = 2 的文档，但我设置了 boost_multiplier：4 和 id： 3 ：
More interesting thing happens when I index a new document: let's say it is almost the same as document with id=2 but I set "boost_multiplier" : 4 and "id": 3:
{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "3",
  "tags" : [
    ...
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    ...
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 4
}

使用海岸景观项运行相同的查询会返回三个文档：
Running the same query with coast landscape term returns three documents:
 
 文档为 id = 3 且得分为 _score的文档：360.0 2664 
 
 文档的 id = 1 并为 _score评分：182.09859 
 
 文档的 id = 2 并得分 _score：90.00666 
 
 

document with id=3 and scoring "_score": 360.02664
document with id=1 and scoring "_score": 182.09859
document with id=2 and scoring "_score": 90.00666

尽管文档中的 id = 3 只有一个匹配单词（ landscape ），其 boost_multiplier 值大大提高了得分。在这里，使用 factor：1 ，您还可以决定该值应增加多少分值，使用 missing：0 决定如果没有为该字段建立索引应该怎么办。
Although document with id=3 has only one matching word (landscape), its boost_multiplier value considerably increased the scoring. Here, with "factor": 1, you can also decide how much this value should increase scoring and with "missing": 0 decide what should happen if no such field is indexed.

                        这篇关于Elasticsearch：文档pt.2中具有自定义得分字段的影响力得分的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Elasticsearch：文档pt.2中具有自定义得分字段的影响力得分 [英] Elasticsearch: Influence scoring with custom score field in document pt.2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Elasticsearch：文档pt.2中具有自定义得分字段的影响力得分 [英] Elasticsearch: Influence scoring with custom score field in document pt.2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭