elasticsearch-按百分比过滤 [英] elasticsearch - filter by percentile

查看：348 发布时间：2020/10/28 2:33:00 elasticsearch percentile

本文介绍了elasticsearch-按百分比过滤的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

说我是否要按第10到20个百分点内的某个字段过滤文档。我想知道是否可以通过一些简单的查询来完成，例如 { fieldName：{ percentile：[0.1，0.2]}} 。

Say if I want to filter documents by some field within 10th to 20th percentile. I'm wondering if it's possible by some simple query, something like {"fieldName":{"percentile": [0.1, 0.2]}}.

说我有这些文件：

[{"a":1,"b":101},{"a":2,"b":102},{"a":3,"b":103}, ..., {"a":100,"b":200}]

我需要用 a （升序），即 a 从1到10。然后我需要按 b 降序排列，然后进行分页结果（如第2页，每页10个项目）。


I need to filter the top 10th of them by a (with ascending order), that would be a  from 1 to 10. Then I need to sort those results by b with descending order, then take the paginated result (like page No.2, with 10 items every page).
一个解决方案是：
 
  获取文档总数。

get the total count of the documents.
将文档按 a 排序，取相应的 _id ，限制为 0.1 * total_count  
sort the documents by a, take the corresponding _id with limit 0.1 * total_count
写入最终查询，类似于 id由b  

write the final query, something like id in (...) order by b
但是缺点是
 
  如果我们谈论的是亚秒级延迟，似乎效率不高

seems not effecient if we're talking about subsecond latency
如果我们在第一个查询中返回的 _id 太多，则第二个查询可能无法工作（默认情况下，ES仅允许1000。我可以更改配置，但是总会有一个限制。 
the second query might not work if we have too many _id returned in the first query (ES only allows 1000 by default. I can change the config of course, but there's always a limit). 
 
 
推荐答案
我怀疑是否有一种方法可以在一个查询中执行此操作如果我不知道 a 的确切值是什么，尽管我认为一种非常有效的方法是可行的。
I doubt that there is a way to do this in one query if the exact values of a are not known beforehand, although I think one pretty efficient approach is feasible.
 I建议执行 百分位数聚合作为第一个查询，并 范围查询为第二个。
I would suggest to do a percentiles aggregation as first query and range query as second.
在我的样本索引中，我只有14个文档，因此出于解释的原因，我将尝试查找占 a 字段30％到60％的那些文档，并按照字段 b 的顺序相反（因此请确保排序有效）。 
In my sample index I have only 14 documents, so for explanatory reasons I will try to find those documents that are from 30% to 60% of field a and sort them by field b in inverse order (so to be sure that sort worked). 
以下是我插入的文档：
{"a":1,"b":101}
{"a":5,"b":105}
{"a":10,"b":110}
{"a":2,"b":102}
{"a":6,"b":106}
{"a":7,"b":107}
{"a":9,"b":109}
{"a":4,"b":104}
{"a":8,"b":108}
{"a":12,"b":256}
{"a":13,"b":230}
{"a":14,"b":215}
{"a":3,"b":103}
{"a":11,"b":205}

让我们找出字段 a 在30％到60％百分位数之间的界限：
Let's find out which are the bounds for field a between 30% and 60% percentiles:
POST my_percent/doc/_search
{
    "size": 0,
    "aggs" : {
        "percentiles" : {
            "percentiles" : {
                "field" : "a",
                "percents": [ 30, 60, 90 ]
            }
        }
    }
}

对于我的样本索引，它看起来像这样：
With my sample index it looks like this:
{
...
  "hits": {
    "total": 14,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "percentiles": {
      "values": {
        "30.0": 4.9,
        "60.0": 8.8,
        "90.0": 12.700000000000001
      }
    }
  }
}

现在我们可以使用边界进行范围查询：
Now we can use the boundaries to do the range query:
POST my_percent/doc/_search
{
    "query": {
      "range": {
            "a" : {
                "gte" : 4.9,
                "lte" : 8.8
            }
        }
    },
    "sort": {
      "b": "desc"
    }
}

结果为：
{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": null,
    "hits": [
      {
        "_index": "my_percent",
        "_type": "doc",
        "_id": "vkFvYGMB_zM1P5OLcYkS",
        "_score": null,
        "_source": {
          "a": 8,
          "b": 108
        },
        "sort": [
          108
        ]
      },
      {
        "_index": "my_percent",
        "_type": "doc",
        "_id": "vUFvYGMB_zM1P5OLWYkM",
        "_score": null,
        "_source": {
          "a": 7,
          "b": 107
        },
        "sort": [
          107
        ]
      },
      {
        "_index": "my_percent",
        "_type": "doc",
        "_id": "vEFvYGMB_zM1P5OLRok1",
        "_score": null,
        "_source": {
          "a": 6,
          "b": 106
        },
        "sort": [
          106
        ]
      },
      {
        "_index": "my_percent",
        "_type": "doc",
        "_id": "u0FvYGMB_zM1P5OLJImy",
        "_score": null,
        "_source": {
          "a": 5,
          "b": 105
        },
        "sort": [
          105
        ]
      }
    ]
  }
}

请注意，百分位数聚合的结果是近似值。
Note that the results of percentiles aggregation are approximate.
通常，通过 pandas 或火花工作。
希望有帮助！

                        这篇关于elasticsearch-按百分比过滤的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

elasticsearch-按百分比过滤 [英] elasticsearch - filter by percentile

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

elasticsearch-按百分比过滤 [英] elasticsearch - filter by percentile

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭