如何在不返回 Elasticsearch 中的所有存储桶的情况下执行管道聚合 [英] How to perform a pipeline aggregation without returning all buckets in Elasticsearch

查看:36
本文介绍了如何在不返回 Elasticsearch 中的所有存储桶的情况下执行管道聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Elasticsearch 2.3 并且我正在尝试使用管道聚合执行两步计算.我只对管道聚合的最终结果感兴趣,但 Elasticsearch 会返回所有存储桶信息.

I'm using Elasticsearch 2.3 and I'm trying to perform a two-step computation using a pipeline aggregation. I'm only interested in the final result of my pipeline aggregation but Elasticsearch returns all the buckets information.

由于我有大量的存储桶(数千万或数亿),所以这是令人望而却步的.不幸的是,我找不到办法告诉 Es 不要返回所有这些信息.

Since I have a huge number of buckets (tens or hundreds of millions), this is prohibitive. Unfortunately, I cannot find a way to tell Es not to return all this information.

这是一个玩具示例.我有一个文档类型 obj 的索引 test-index.obj 有两个字段,keyvalues.

Here is a toy example. I have an index test-index with a document type obj. obj has two fields, key and values.

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 100,
  "key": "foo"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 20,
  "key": "foo"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 50,
  "key": "bar"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 60,
  "key": "bar"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 70,
  "key": "bar"
}'

我想获得具有相同 obj 的最小 value 的平均值(在所有 keys 上)键s.最小值的平均值.

I want to get the average value (over all keys ) of the minimum value of objs having the same keys. An average of minima.

Elasticsearch 允许我这样做:

Elasticsearch allows me to do this:

curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search' -d '{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "key_aggregates": {
      "terms": {
        "field": "key",
        "size": 0
      },
      "aggs": {
        "min_value": {
          "min": {
            "field": "value"
          }
        }
      }
    },
    "avg_min_value": {
      "avg_bucket": {
        "buckets_path": "key_aggregates>min_value"
      }
    }
  }
}'

但是这个查询返回每个桶的最小值,虽然我不需要它:

But this query returns the minimum for every bucket, although I don't need it:

{
  "took": 21,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": [

    ]
  },
  "aggregations": {
    "key_aggregates": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "bar",
          "doc_count": 2,
          "min_value": {
            "value": 50
          }
        },
        {
          "key": "foo",
          "doc_count": 2,
          "min_value": {
            "value": 20
          }
        }
      ]
    },
    "avg_min_value": {
      "value": 35
    }
  }
}

有没有办法摆脱"buckets": [...]中的所有信息?我只对 avg_min_value 感兴趣.

Is there a way to get rid of all the information inside "buckets": [...]? I'm only interested in avg_min_value.

在这个玩具示例中这可能看起来不是问题,但是当不同的 key 的数量不大(数千万或数亿)时,查询响应非常大,我想修剪它.

This might not seem like a problem in this toy example, but when the number of different keys is not big (tens or hundreds of millions), the query response is prohibitively large, and I would like to prune it.

有没有办法用 Elasticsearch 做到这一点?还是我的数据建模有误?

Is there a way to do this with Elasticsearch? Or am I modelling my data wrong?

注意:按键预先聚合我的数据是不可接受的,因为我的查询的 match_all 部分可能会被复杂和未知的过滤器替换.

NB: it is not acceptable to pre-aggregate my data per key, since the match_all part of my query might be replaced by complex and unknown filters.

NB2:在我的 terms 聚合中将 size 更改为非负数是不可接受的,因为它会改变结果.

NB2: changing size to a non-negative number in my terms aggregation is not acceptable because it would change the result.

推荐答案

我遇到了同样的问题,经过大量研究后,我找到了一个解决方案,并想在这里分享.

I had the same issue and after doing quite a bit of research I found a solution and thought I'd share here.

您可以使用 响应过滤 功能可过滤您想要接收的部分答案.

You can use the Response Filtering feature to filter the part of the answer that you want to receive.

您应该能够通过将查询参数 filter_path=aggregations.avg_min_value 添加到搜索 URL 来实现您想要的.在示例案例中,它应该类似于:

You should be able to achieve what you want by adding the query parameter filter_path=aggregations.avg_min_value to the search URL. In the example case, it should look similar to this:

curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search?filter_path=aggregations.avg_min_value' -d '{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "key_aggregates": {
      "terms": {
        "field": "key",
        "size": 0
      },
      "aggs": {
        "min_value": {
          "min": {
            "field": "value"
          }
        }
      }
    },
    "avg_min_value": {
      "avg_bucket": {
        "buckets_path": "key_aggregates>min_value"
      }
    }
  }
}'

PS:如果你找到了另一个解决方案,你介意在这里分享吗?谢谢!

PS: if you found another solution would you mind sharing it here? Thanks!

这篇关于如何在不返回 Elasticsearch 中的所有存储桶的情况下执行管道聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆