在具有特定结果集的弹性搜索中按多列分组 [英] Group by multiple columns in elastic search with specific result set

查看:98
本文介绍了在具有特定结果集的弹性搜索中按多列分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是ES的新手,我有一些特定的要求,我的文档看起来像这样

I am new to ES and i have some specific requirement, my document look like this

{
    "_index" : "bidder_server_stats",
    "_type" : "doc",
    "_id" : "_NTrHGQBv0YTjfMi0Ord",
    "_score" : 1.0,
    "_source" : {
      "avg_price" : 5.8,
      "bc" : "2513",
      "log_dt_st" : "2018-06-08T06:36:16.073Z",
      "nid" : "1",
      "cc" : "880",
      "host" : "ip-172-31-18-62.ec2.internal",
      "country" : "us"
    }
  },
  {
    "_index" : "bidder_server_stats",
    "_type" : "doc",
    "_id" : "_NTrHGQBv0YTjfMi0Ord",
    "_score" : 1.0,
    "_source" : {
      "avg_price" : 10,
      "bc" : "2514",
      "log_dt_st" : "2018-06-08T06:36:16.073Z",
      "nid" : "1",
      "cc" : "880",
      "host" : "ip-172-31-18-62.ec2.internal",
      "country" : "us"
    }
  },
  {
    "_index" : "bidder_server_stats",
    "_type" : "doc",
    "_id" : "_NTrHGQBv0YTjfMi0Ord",
    "_score" : 1.0,
    "_source" : {
      "avg_price" : 11,
      "bc" : "2513",
      "log_dt_st" : "2018-06-08T06:36:16.073Z",
      "nid" : "1",
      "cc" : "880",
      "host" : "ip-172-31-18-62.ec2.internal",
      "country" : "us"
    }
  }

现在我需要使用以下查询得到的结果

Now i need the result as i get using below query

select bc,log_dt_st,sum(avg_price) from table group by bc,log_dt_st.

我们如何在Elasticsearch中做到这一点。而且我只希望结果集中的这三列(即_source)。

How can we do this in elasticsearch. And i want only these three columns in result set (i.e. _source).

请帮助

推荐答案

您可以通过子聚合。从ES 6.1开始, 复合 聚合也可以派上用场(尽管它仍处于试验阶段)。

You can achieve this with sub-aggregations. Starting from ES 6.1, composite aggregation could also come handy (although it is still experimental).

查询可能如下所示:

POST bidder_server_stats/doc/_search
{
  "size": 0,
  "aggs": {
    "by bc": {
      "terms": {
        "field": "bc"
      },
      "aggs": {
        "by log_dt_st": {
          "terms": {
            "field": "log_dt_st"
          },
          "aggs": {
            "sum(avg_price)": {
              "sum": {
                "field": "avg_price"
              }
            }
          }
        }
      }
    }
  }
}

,响应如下所示:

{
  ...
  "aggregations": {
    "by bc": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "2513",
          "doc_count": 2,
          "by log_dt_st": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 1528439776073,
                "key_as_string": "2018-06-08T06:36:16.073Z",
                "doc_count": 2,
                "sum(avg_price)": {
                  "value": 16.800000190734863
                }
              }
            ]
          }
        },
        {
          "key": "2514",
          "doc_count": 1,
          "by log_dt_st": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 1528439776073,
                "key_as_string": "2018-06-08T06:36:16.073Z",
                "doc_count": 1,
                "sum(avg_price)": {
                  "value": 10
                }
              }
            ]
          }
        }
      ]
    }
  }
}

很少需要考虑的时刻:


  • bc 应该具有关键字 聚合)

  • 聚合默认返回前10个存储桶;您可能对此聚合的大小 sort 选项感兴趣

  • bc should have keyword type (to be able to do terms aggregation on it)
  • terms aggregation only returns top-10 buckets by default; you may be interested in size and sort options of this aggregation

更新:响应评论中的问题,因为它可以改善答案。

Update: Responding to the questions from the comments since it will improve the answer.

不,不是直接添加。像在SQL GROUP BY 中一样,返回的所有字段都应为 GROUP BY 的一部分或聚合函数。

No, not directly. Like in a SQL GROUP BY, all fields returned should be either part of GROUP BY or aggregate functions.

除了聚合之外,几乎没有其他选项可以真正获取更多数据:

There are few options to actually get more data alongside the aggregations:


  • 搜索结果本身( hits 部分);

  • top_hits 聚合,它可以为

  • the search results themselves (the hits part);
  • top_hits aggregation, which allows to have a few most relevant documents for a given bucket.

我找不到任何相关文档或配置可以确定答案的设置。但是,在 index.max_docvalue_fields_search 设置默认为 100 的设置。 elastic.co/guide/en/elasticsearch/reference/6.3/index-modules.html#dynamic-index-settings rel = nofollow noreferrer>动态索引设置。由于聚合使用 doc_values ,我想说大约100个存储桶聚合是一个合理的上限。

I cannot find any relevant documentation or a configuration setting that would allow to have a sure answer. However, there is index.max_docvalue_fields_search setting that defaults to 100 in Dynamic index settings. Since aggregations use doc_values, I'd say that around 100 bucket aggregations is a reasonable upper limit.

我相信这里的限制是您的Elasticsearch集群的实际性能。

I believe that the limitation here is the actual performance of your Elasticsearch cluster.

这可以做到,但可能效率不高。您可以使用 条款聚合的脚本 模式。查询可能看起来像这样:

It can be done, but might be not efficient. You may use script mode of terms aggregation. The query might look like this:

POST bidder_server_stats/doc/_search
{
  "size": 0,
  "aggs": {
    "via script": {
      "terms": {
        "script": {
          "source": "doc['bc'].value +':::'+ doc['log_dt_st'].value ",
          "lang": "painless"
        }
      },
      "aggs": {
        "sum(avg_price)": {
          "sum": {
            "field": "avg_price"
          }
        }
      }
    }
  }
}

结果将类似于以下内容:

And the result will look like the following:

{
  ...
  "aggregations": {
    "via script": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "2513:::2018-06-08T06:36:16.073Z",
          "doc_count": 2,
          "sum(avg_price)": {
            "value": 16.800000190734863
          }
        },
        {
          "key": "2514:::2018-06-08T06:36:16.073Z",
          "doc_count": 1,
          "sum(avg_price)": {
            "value": 10
          }
        }
      ]
    }
  }
}

为了执行此操作阿格雷加弹性搜索将必须为每个与查询匹配的文档计算存储桶值,这相当于SQL中的完整扫描。相反,聚合更像是索引查找,因为它们使用 doc_values 数据表示形式,这种数据结构使这些查找有效。

In order to perform this aggregation Elasticsearch will have to compute the bucket value for each document matching the query, which is an equivalent of the full scan in SQL. Aggregations instead are more like index look-ups, since they use doc_values data representation, a data structure that makes these look-ups efficient.

在某些情况下,脚本存储桶可以作为一种解决方案,但是它们的范围非常有限。如果您对基于脚本的解决方案感兴趣,还可以脚本化指标汇总

In some cases script buckets can be a solution, but their scope is quite limited. In case you are interested in script-based solution there is also scripted metric aggregation to consider.

希望有帮助!

在Elasticsearch 6.1中进行 composite 聚合已添加。从6.3开始,它仍标记为实验性(因此API可能会更改,或者将来可能会完全删除此功能)。

In Elasticsearch 6.1 composite aggregation was added. As of 6.3, it is still marked as experimental (so the API might change, or this feature could be removed completely in the future).

在这种情况下,查询如下:

The query in this case would look like:

POST bidder_server_stats/doc/_search
{
  "size": 0,
  "aggs": {
    "my composite": {
      "composite": {
        "sources": [
          {
            "bc": {
              "terms": {
                "field": "bc"
              }
            }
          },
          {
            "log_dt_st": {
              "terms": {
                "field": "log_dt_st"
              }
            }
          }
        ]
      },
      "aggs": {
        "sum(avg_price)": {
          "sum": {
            "field": "avg_price"
          }
        }
      }
    }
  }
}

和响应:

{
  "aggregations": {
    "my composite": {
      "after_key": {
        "bc": "2514",
        "log_dt_st": 1528439776073
      },
      "buckets": [
        {
          "key": {
            "bc": "2513",
            "log_dt_st": 1528439776073
          },
          "doc_count": 2,
          "sum(avg_price)": {
            "value": 16.800000190734863
          }
        },
        {
          "key": {
            "bc": "2514",
            "log_dt_st": 1528439776073
          },
          "doc_count": 1,
          "sum(avg_price)": {
            "value": 10
          }
        }
      ]
    }
  }
}

这篇关于在具有特定结果集的弹性搜索中按多列分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆