如何查找文档是否适合查询,例如规范化Elasticsearch得分? [英] How to find if a document is a good match for a query, e.g., normalize elasticsearch score?

查看:74
本文介绍了如何查找文档是否适合查询,例如规范化Elasticsearch得分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Elasticsearch计算出的分数提供了文档之间的排名,但是并不能说明文档是否与请求匹配。当前,第一个文档可以在所有字段上匹配,也可以只匹配一个。分数提供的唯一信息是它是最匹配的。



是否可以针对查询获得标准化分数?例如,得分为1的文档将完全匹配查询,而得分为0.1的文档将完全匹配查询。

解决方案

简而言之,不是,不可能获得查询的真实归一化分数,但是有可能获得足够好的分数归一化,在许多情况下都可以使用。



获得分数以判断该文档是否与查询匹配的问题是找到最适合该查询的文档,从而找到最高分数。



即使使用简单的匹配查询,从技术上讲,您也可以使用文档获得无限的分数。



无限次重复查询的字词。没有分数的限制,就不可能获得真正的归一化分数。



但是,所有的希望都不会丢失。除了可以对可能的最高分进行归一化处理外,您还可以针对应该获得最高分的伪造理想文档进行归一化处理。例如,如果要查询两个字段 name occupation 并查询条件为 Jane Doe Cook 您理想的文档可以是

  {
name: Jane Doe,
occupation: Cook
}

如果索引包含名称为 Jane Jane Doe 的文档,则理想文档可能不会获得最高分。如果查询的字段相对较短,则可能不必担心术语重复。如果您的字段包含许多术语,则可以决定复制理想文档中经常使用的某些术语。如果目标是确定文档是否匹配良好,则文档得分高于理想文档通常不是问题。



好的有个新闻是,如果您至少使用elasticsearch 6.4,则不必为伪造文档建立索引就可以得到其查询分数。您可以使用端点 _scripts / painless / _execute 获取理想文档的分数。

 获取_scripts / painless / _execute 
{
script:{
source: _score
},
context: score,
context_setup:{
索引:< INDEX>,
文档:< THE_IDEAL_DOCUMENT>,
查询:< YOUR_QUERY>
}
}

请注意,伪造文件的字段统计信息例如因为在计算分数时将考虑包含一个字段的文档数和包含查询词的字段数。如果您有很多文档,这应该不成问题,但是与以前索引的文档相比,对于非常少的字段或术语(例如低于20),您会发现理想文档的得分较低。


The score computed by Elasticsearch provides a ranking between the documents, but it does not tell if the documents are a good match for the request. Currently, the first document can either match on all fields or just one. The only information that the score provides is that it is the best match.

Would it be possible to get a normalized score with respect to the query ? For example, a score of 1 would be a document matching perfectly the query and a score of 0.1 a document matching poorly.

解决方案

In short, no, it is not possible to get a real normalized score for a query, but it is possible to get a good enough score normalization that works in many cases.

The problem to get a score that tells if the document is a good match or not for a query is to find what would be the best document for this query, and consequently the maximum score. Using elasticsearch and most (if not all) metrics, the maximum score is not bounded.

Even with a simple match query, you can technically reach an infinite score with a document that repeat the queried term an infinite number of time. Without bound on the score, it is not possible to get a true normalized score.

But all hopes are not lost. Instead of normalizing against the best possible score you can normalize against a fake ideal document which is supposed to get the maximum score. For example, if you are querying two fields name and occupation with queried terms Jane Doe and Cook your ideal document can be

{
    "name": "Jane Doe",
    "occupation": "Cook"
}

If the index contains a document with for example the name Jane Jane Doe then the ideal document may not get the maximum score. If the queried fields are relatively short, you probably do not have to worry about term duplication. If you have fields with many terms you may decide to duplicate some terms which are frequent in the ideal document. If the objective is to find if the document is a good match or not, it is usually not a problem to have a document scored higher than the ideal document.

The good news is that if you are using at least elasticsearch 6.4 you do not have to index the fake document to get its score for a query. You may use the endpoint _scripts/painless/_execute to obtain the score of the ideal document.

GET _scripts/painless/_execute
{
    "script": {
        "source": "_score"
    },
    "context": "score",
    "context_setup": {
        "index": <INDEX>,
        "document": <THE_IDEAL_DOCUMENT>,
        "query": <YOUR_QUERY>
    }
}

Please note that the fields statistics of the fake document such as the number of documents containing a field and the number of fields containing the queried term will be taken into account when computing the score. If you have many documents, this should not be a problem, but for very not frequent field or term (say below 20) you can notice a lower score for the ideal document compared to a previously indexed document.

这篇关于如何查找文档是否适合查询,例如规范化Elasticsearch得分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆