Elasticsearch:文档pt.2中具有自定义得分字段的影响力得分 [英] Elasticsearch: Influence scoring with custom score field in document pt.2

查看:80
本文介绍了Elasticsearch:文档pt.2中具有自定义得分字段的影响力得分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具有以下文档:

{
  "created_at" : "2017-07-31T20:30:14-04:00",
  "description" : null,
  "height" : 3213,
  "id" : "1",
  "tags" : [
    {
      "confidence" : 65.48948436785749,
      "tag" : "beach"
    },
    {
      "confidence" : 57.31950504425406,
      "tag" : "sea"
    },
    {
      "confidence" : 43.58207236617374,
      "tag" : "coast"
    },
    {
      "confidence" : 35.6857910950816,
      "tag" : "sand"
    },
    {
      "confidence" : 33.660057321079655,
      "tag" : "landscape"
    },
    {
      "confidence" : 32.53252312423727,
      "tag" : "sky"
    }
  ],
  "width" : 5712,
  "color" : "#0C0A07",
  "boost_multiplier" : 1
}

{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "2",
  "tags" : [
    {
      "confidence" : 84.09123410403951,
      "tag" : "mountain"
    },
    {
      "confidence" : 56.412795342449456,
      "tag" : "valley"
    },
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    {
      "confidence" : 40.51100450186575,
      "tag" : "mountains"
    },
    {
      "confidence" : 33.14263528292239,
      "tag" : "sky"
    },
    {
      "confidence" : 31.064394646169404,
      "tag" : "peak"
    },
    {
      "confidence" : 29.372,
      "tag" : "natural elevation"
    }
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 1
}

我想获得基于每个标签的置信度值计算的_score。例如,如果您搜索 mountain,则显然应该仅返回ID为1的文档;如果您搜索 landscape,则得分2应当高于1,因为景观对2的置信度高于1(48.36 vs 33.66)。如果您搜索 coast landscape,则此时间得分1应该高于2,因为doc 1在标签数组中同时包含了Coast和landscape。我还想将分数与 boost_multiplier相乘,以提高某些文档的性能。

I want to get the _score calculated based on the confidence values for each tag. For example if you search "mountain" it should return only doc with id 1 obviously, if you search "landscape", score of 2 should be higher then 1, as confidence of landscape in 2 is higher than 1 (48.36 vs 33.66). If you search for "coast landscape", this time score of 1 should be higher than 2, because doc 1 has both coast and landscape in the tags array. I also want to multiply the score with "boost_multiplier" to boost some documents against others.

我在SO中找到了这个问题, Elasticsearch:文档中具有自定义评分字段的影响力得分

I found this question in SO, Elasticsearch: Influence scoring with custom score field in document

但是当我尝试接受的解决方案时(i在我的ES服务器中启用脚本),则无论搜索字词如何,它都返回具有_score 1.0的两个文档。这是我尝试的查询:

But when I tried the accepted solution (i enabled scripting in my ES server), it returns both documents with having _score 1.0, regardless the search term. Here is my query that I tried:

{
  "query": {
    "nested": {
      "path": "tags",
      "score_mode": "sum",
      "query": {
        "function_score": {
          "query": {
            "match": {
              "tags.tag": "coast landscape"
            }
          },
          "script_score": {
            "script": "doc[\"confidence\"].value"
          }
        }
      }
    }
  }
}

我也尝试了@yahermann在评论中建议的内容,将 script_score替换为 field_value_factor:{ field : confidence},结果仍然相同。知道为什么它会失败,或者有更好的方法吗?

I also tried what @yahermann suggested in the comments, replacing "script_score" with "field_value_factor" : { "field" : "confidence" }, still the same result. Any idea why it fails, or is there better way to do it?

只是为了完整介绍,这是我使用的映射定义:

Just to have complete picture, here is the mapping definition that I've used:

{
  "mappings": {
    "photo": {
      "properties": {
        "created_at": {
          "type": "date"
        },
        "description": {
          "type": "text"
        },
        "height": {
          "type": "short"
        },
        "id": {
          "type": "keyword"
        },
        "tags": {
          "type": "nested",
          "properties": {
            "tag": { "type": "string" },
            "confidence": { "type": "float"}
          }
        },
        "width": {
          "type": "short"
        },
        "color": {
          "type": "string"
        },
        "boost_multiplier": {
          "type": "float"
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1
  }
}

更新
按照下面@Joanna的回答,我尝试了查询,但实际上,无论我将什么放入匹配查询中, Coast,foo,bar,它总是返回两个文件都带有_score 1.0的文档,我在Docker中的elasticsearch 2.4.6、5.3、5.5.1上尝试过。这是我得到的响应:

UPDATE Following the answer of @Joanna below, I tried the query, but in fact, whatever I put in match query, coast, foo, bar, it always return both documents with _score 1.0 for both of them, I tried it on elasticsearch 2.4.6, 5.3, 5.5.1 in Docker. Here is the response I get:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 1635

{"took":24,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"my_index","_type":"my_type","_id":"2","_score":1.0,"_source":{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "2",
  "tags" : [
    {
      "confidence" : 84.09123410403951,
      "tag" : "mountain"
    },
    {
      "confidence" : 56.412795342449456,
      "tag" : "valley"
    },
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    {
      "confidence" : 40.51100450186575,
      "tag" : "mountains"
    },
    {
      "confidence" : 33.14263528292239,
      "tag" : "sky"
    },
    {
      "confidence" : 31.064394646169404,
      "tag" : "peak"
    },
    {
      "confidence" : 29.372,
      "tag" : "natural elevation"
    }
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 1
}
},{"_index":"my_index","_type":"my_type","_id":"1","_score":1.0,"_source":{
  "created_at" : "2017-07-31T20:30:14-04:00",
  "description" : null,
  "height" : 3213,
  "id" : "1",
  "tags" : [
    {
      "confidence" : 65.48948436785749,
      "tag" : "beach"
    },
    {
      "confidence" : 57.31950504425406,
      "tag" : "sea"
    },
    {
      "confidence" : 43.58207236617374,
      "tag" : "coast"
    },
    {
      "confidence" : 35.6857910950816,
      "tag" : "sand"
    },
    {
      "confidence" : 33.660057321079655,
      "tag" : "landscape"
    },
    {
      "confidence" : 32.53252312423727,
      "tag" : "sky"
    }
  ],
  "width" : 5712,
  "color" : "#0C0A07",
  "boost_multiplier" : 1
}
}]}}

UPDATE-2
我在SO上找到了这个: Elasticsearch: function_score用 boost_mode:替换。忽略函数得分

它基本上说,如果函数不匹配,则返回1。这很有意义,但我正在运行查询相同的文档。

It basically says, if function doesn't match, it returns 1. That makes sense, but I'm running the query for the same docs. That's confusing.

最终更新
最后,我发现了问题,我很愚蠢。 ES101,如果您发送GET请求以搜索api,它将返回所有得分为1.0的文档:)您应该发送POST请求...非常感谢@Joanna,它的工作原理非常棒!!!

FINAL UPDATE Finally I found the problem, stupid me. ES101, if you send GET request to search api, it returns all documents with score 1.0 :) You should send POST request... Thx a lot @Joanna, it works perfectly!!!

推荐答案

您可以尝试以下查询-它结合了得分和以下两种:置信度 boost_multiplier 字段:

You may try this query - it combines scoring with both: confidence and boost_multiplier fields:

{
  "query": {
    "function_score": {
        "query": {
            "bool": {
                "should": [{
                    "nested": {
                      "path": "tags",
                      "score_mode": "sum",
                      "query": {
                        "function_score": {
                          "query": {
                            "match": {
                              "tags.tag": "landscape"
                            }
                          },
                          "field_value_factor": {
                            "field": "tags.confidence",
                            "factor": 1,
                            "missing": 0
                          }
                        }
                      }
                    }
                }]
            }
        },
        "field_value_factor": {
            "field": "boost_multiplier",
            "factor": 1,
            "missing": 0
        }
      }
    }
} 

当我用海岸词进行搜索时,它会返回:

When I search with coast term - it returns:

    带有 id = 1
  • 文档,因为只有这个有这个术语,并且得分是 _ score:100.27469

  • document with id=1 as only this one has this term, and the scoring is "_score": 100.27469.

当我使用 landscape 搜索时术语-它返回两个文档:

When I search with landscape term - it returns two documents:


  • 个文档,其中 id = 2 并在 _score中评分:85.83046

  • 文档的 id = 1 并得分 _score:59.7339

  • document with id=2 and scoring "_score": 85.83046
  • document with id=1 and scoring "_score": 59.7339

作为 id = 2的文档置信度字段的值较高,得分更高。

As document with id=2 has higher value of confidence field, it gets higher scoring.

当我使用海岸景观术语进行搜索时-它返回两个文档:

When I search with coast landscape term - it returns two documents:


  • 文档 id = 1 并为 _score评分:160.00859

  • 文档 id = 2 并得分 _score:85.83046

  • document with id=1 and scoring "_score": 160.00859
  • document with id=2 and scoring "_score": 85.83046

尽管文档的 id = 2 具有较高的 confidence 字段值,具有 id = 1 的文档具有匹配的单词,因此得到得分更高。通过更改 factor:1 参数的值,您可以确定信心应该对结果有多大影响。

Although document with id=2 has higher value of confidence field, document with id=1 has both matching words so it gets much higher scoring. By changing the value of "factor": 1 parameter, you can decide how much confidence should influence the results.

在索引新文档时会发生更有趣的事情:假设它与 id = 2 的文档,但我设置了 boost_multiplier:4 id: 3

More interesting thing happens when I index a new document: let's say it is almost the same as document with id=2 but I set "boost_multiplier" : 4 and "id": 3:

{
  "created_at" : "2017-07-31T20:43:17-04:00",
  "description" : null,
  "height" : 4934,
  "id" : "3",
  "tags" : [
    ...
    {
      "confidence" : 48.36547551196872,
      "tag" : "landscape"
    },
    ...
  ],
  "width" : 4016,
  "color" : "#FEEBF9",
  "boost_multiplier" : 4
}

使用海岸景观项运行相同的查询会返回三个文档:

Running the same query with coast landscape term returns three documents:


  • 文档为 id = 3 且得分为 _score的文档:360.0 2664

  • 文档的 id = 1 并为 _score评分:182.09859

  • 文档的 id = 2 并得分 _score:90.00666

  • document with id=3 and scoring "_score": 360.02664
  • document with id=1 and scoring "_score": 182.09859
  • document with id=2 and scoring "_score": 90.00666

尽管文档中的 id = 3 只有一个匹配单词( landscape ),其 boost_multiplier 值大大提高了得分。在这里,使用 factor:1 ,您还可以决定该值应增加多少分值,使用 missing:0 决定如果没有为该字段建立索引应该怎么办。

Although document with id=3 has only one matching word (landscape), its boost_multiplier value considerably increased the scoring. Here, with "factor": 1, you can also decide how much this value should increase scoring and with "missing": 0 decide what should happen if no such field is indexed.

这篇关于Elasticsearch:文档pt.2中具有自定义得分字段的影响力得分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆