SQL 类似于 GROUP BY AND HAVING [英] SQL like GROUP BY AND HAVING

查看:45
本文介绍了SQL 类似于 GROUP BY AND HAVING的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获得满足特定条件的组的数量.在 SQL 术语中,我想在 Elasticsearch 中执行以下操作.

SELECT COUNT(*) FROM(选择发件人转销商 ID,SUM(requestAmountValue) AS t_amount从交易通过...分组发件人转销商 ID有t_amount >10000 ) AS 哑;

到目前为止,我可以通过术语聚合按 senderResellerId 分组.但是当我应用过滤器时,它没有按预期工作.

弹性请求

<代码>{聚合":{reseller_sale_sum":{aggs":{销售量": {聚合":{reseller_sale":{总和":{字段":requestAmountValue"}}},筛选": {范围": {reseller_sale":{gte":10000}}}}},条款":{"field": "senderResellerId",命令": {"sales>reseller_sale": "desc"},大小":5}}},分机":{},"查询": { "match_all": {} },尺寸":0}

实际响应

<代码>{拿":21,timed_out":假,_shards":{总":1,成功":1,失败":0},命中":{总计":150824,最大分数":0.0,命中":[ ]},聚合":{reseller_sale_sum":{doc_count_error_upper_bound":-1,sum_other_doc_count":149609,桶":[{"key" : "RES0000000004",文档计数":8,销售量" : {文档计数":0,reseller_sale":{价值":0.0}}},{"key" : "RES0000000005",文档计数":39,销售量" : {文档计数":0,reseller_sale":{价值":0.0}}},{"key" : "RES0000000006",文档计数":57,销售量" : {文档计数":0,reseller_sale":{价值":0.0}}},{"key" : "RES0000000007",文档计数":134,销售量" : {文档计数":0,reseller_sale":{价值":0.0}}}}}]}}}

从上面的响应中可以看出,它正在返回经销商,但 reseller_sale 聚合在结果中为零.

更多详情在此处.

解决方案

HAVING 类似行为

您可以使用 <代码>管道聚合,即存储桶选择器聚合.查询将如下所示:

POST my_index/tdrs/_search{聚合":{reseller_sale_sum":{聚合":{销售":{总和":{字段":requestAmountValue"}},最大销售额":{bucket_selector":{buckets_path":{var1":销售额"},脚本":params.var1"10000"}}},条款":{字段":senderResellerId",订单":{销售":desc"},尺寸":5}}},大小":0}

将以下文档放入索引后:

 命中":[{_index":my_index",_type":tdrs",_id":AV9Yh5F-dSw48Z0DWDys",_score":1,_源":{requestAmountValue":7000,senderResellerId":ID_1";}},{_index":my_index",_type":tdrs",_id":AV9Yh684dSw48Z0DWDyt",_score":1,_源":{requestAmountValue":5000,senderResellerId":ID_1";}},{_index":my_index",_type":tdrs",_id":AV9Yh8TBdSw48Z0DWDyu",_score":1,_源":{requestAmountValue":1000,senderResellerId":ID_2"}}]

查询结果为:

聚合":{reseller_sale_sum":{doc_count_error_upper_bound":0,sum_other_doc_count":0,桶":[{密钥":ID_1",doc_count":2,销售":{值":12000}}]}}

即仅那些 senderResellerId 的累计销售额为 >10000.

计数桶

要实现 SELECT COUNT(*) FROM (... HAVING) 的等效项,可以使用 bucket 脚本聚合sum 桶聚合.虽然似乎没有直接的方法来计算 bucket_selector 实际选择了多少个桶,但我们可以定义一个 bucket_script 产生 0>1 取决于条件,以及产生其 sumsum_bucket:

POST my_index/tdrs/_search{聚合":{reseller_sale_sum":{聚合":{销售":{总和":{字段":requestAmountValue"}},最大销售额":{bucket_script":{buckets_path":{var1":销售额"},脚本":if (params.var1 > 10000) { 1 } else { 0 }";}}},条款":{字段":senderResellerId",订单":{销售":desc"}}},max_sales_stats":{sum_bucket":{buckets_path":reseller_sale_sum>max_sales"}}},大小":0}

输出将是:

 聚合":{reseller_sale_sum":{doc_count_error_upper_bound":0,sum_other_doc_count":0,桶":[...]},max_sales_stats":{值":1}}

所需的桶数位于max_sales_stats.value.

重要注意事项

我必须指出两点:

  1. 该功能是实验性的(从 ES 5.6 开始,它仍然是实验性的,尽管它已添加到 2.0.0-beta1.)
  2. 管道聚合应用于先前聚合的结果:

<块引用>

管道聚合作用于其他聚合产生的输出,而不是从文档集中,将信息添加到输出树中.

这意味着 bucket_selector 聚合将在 senderResellerId 上的 terms 聚合结果之后应用.例如,如果senderResellerIdterms 聚合定义的size 多,您将不会获得所有 id在带有 sum(sales) > 的集合中10000,但只有出现在 terms 聚合的输出中的那些.考虑使用排序和/或设置足够的 size 参数.

这也适用于第二种情况,COUNT() (... HAVING),它只会计算聚合输出中实际存在的那些桶.

如果此查询太重或桶数太大,请考虑 反规范化您的数据或直接将此总和存储在文档中,因此您可以使用普通的range 查询以实现您的目标.

I want to get the counts of groups which satisfy a certain condition. In SQL terms, I want to do the following in Elasticsearch.

SELECT COUNT(*) FROM
(
   SELECT
    senderResellerId,
    SUM(requestAmountValue) AS t_amount
   FROM
    transactions
   GROUP BY
    senderResellerId
   HAVING
    t_amount > 10000 ) AS dum;

So far, I could group by senderResellerId by term aggregation. But when I apply filters, it does not work as expected.

Elastic Request

{
  "aggregations": {
    "reseller_sale_sum": {
      "aggs": {
        "sales": {
          "aggregations": {
            "reseller_sale": {
              "sum": {
                "field": "requestAmountValue"
              }
            }
          }, 
          "filter": {
            "range": {
              "reseller_sale": { 
                "gte": 10000
              }
            }
          }
        }
      }, 
      "terms": {
        "field": "senderResellerId", 
        "order": {
          "sales>reseller_sale": "desc"
        }, 
        "size": 5
      }
    }
  }, 
  "ext": {}, 
  "query": {  "match_all": {} }, 
  "size": 0
}

Actual Response

{
  "took" : 21,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 150824,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "reseller_sale_sum" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 149609,
      "buckets" : [
        {
          "key" : "RES0000000004",
          "doc_count" : 8,
          "sales" : {
            "doc_count" : 0,
            "reseller_sale" : {
              "value" : 0.0
            }
          }
        },
        {
          "key" : "RES0000000005",
          "doc_count" : 39,
          "sales" : {
            "doc_count" : 0,
            "reseller_sale" : {
              "value" : 0.0
            }
          }
        },
        {
          "key" : "RES0000000006",
          "doc_count" : 57,
          "sales" : {
            "doc_count" : 0,
            "reseller_sale" : {
              "value" : 0.0
            }
          }
        },
        {
          "key" : "RES0000000007",
          "doc_count" : 134,
          "sales" : {
            "doc_count" : 0,
            "reseller_sale" : {
              "value" : 0.0
            }
          }
        }
          }
        }
      ]
    }
  }
}

As you can see from above response, it is returning resellers but the reseller_sale aggregation is zero in results.

More details are here.

解决方案

Implementation of HAVING-like behavior

You may use one of the pipeline aggregations, namely bucket selector aggregation. The query would look like this:

POST my_index/tdrs/_search
{
   "aggregations": {
      "reseller_sale_sum": {
         "aggregations": {
            "sales": {
               "sum": {
                  "field": "requestAmountValue"
               }
            },
            "max_sales": {
               "bucket_selector": {
                  "buckets_path": {
                     "var1": "sales"
                  },
                  "script": "params.var1 > 10000"
               }
            }
         },
         "terms": {
            "field": "senderResellerId",
            "order": {
               "sales": "desc"
            },
            "size": 5
         }
      }
   },
   "size": 0
}

After putting the following documents in the index:

  "hits": [
     {
        "_index": "my_index",
        "_type": "tdrs",
        "_id": "AV9Yh5F-dSw48Z0DWDys",
        "_score": 1,
        "_source": {
           "requestAmountValue": 7000,
           "senderResellerId": "ID_1"
        }
     },
     {
        "_index": "my_index",
        "_type": "tdrs",
        "_id": "AV9Yh684dSw48Z0DWDyt",
        "_score": 1,
        "_source": {
           "requestAmountValue": 5000,
           "senderResellerId": "ID_1"
        }
     },
     {
        "_index": "my_index",
        "_type": "tdrs",
        "_id": "AV9Yh8TBdSw48Z0DWDyu",
        "_score": 1,
        "_source": {
           "requestAmountValue": 1000,
           "senderResellerId": "ID_2"
        }
     }
  ]

The result of the query is:

"aggregations": {
      "reseller_sale_sum": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "ID_1",
               "doc_count": 2,
               "sales": {
                  "value": 12000
               }
            }
         ]
      }
   }

I.e. only those senderResellerId whose cumulative sales are >10000.

Counting the buckets

To implement an equivalent of SELECT COUNT(*) FROM (... HAVING) one may use a combination of bucket script aggregation with sum bucket aggregation. Though there seems to be no direct way to count how many buckets did bucket_selector actually select, we may define a bucket_script that produces 0 or 1 depending on a condition, and sum_bucket that produces its sum:

POST my_index/tdrs/_search
{
   "aggregations": {
      "reseller_sale_sum": {
         "aggregations": {
            "sales": {
               "sum": {
                  "field": "requestAmountValue"
               }
            },
            "max_sales": {
               "bucket_script": {
                  "buckets_path": {
                     "var1": "sales"
                  },
                  "script": "if (params.var1 > 10000) { 1 } else { 0 }"
               }
            }
         },
         "terms": {
            "field": "senderResellerId",
            "order": {
               "sales": "desc"
            }
         }
      },
      "max_sales_stats": {
         "sum_bucket": {
            "buckets_path": "reseller_sale_sum>max_sales"
         }
      }
   },
   "size": 0
}

The output will be:

   "aggregations": {
      "reseller_sale_sum": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            ...
         ]
      },
      "max_sales_stats": {
         "value": 1
      }
   }

The desired bucket count is located in max_sales_stats.value.

Important considerations

I have to point out 2 things:

  1. The feature is experimental (as of ES 5.6 it is still experimental, though it was added in 2.0.0-beta1.)
  2. pipeline aggregations are applied on the result of previous aggregations:

Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree.

This means that bucket_selector aggregation will be applied after and on the result of terms aggregation on senderResellerId. For example, if there are more senderResellerId than size of terms aggregation defines, you will not get all the ids in the collection with sum(sales) > 10000, but only those that appear in the output of terms aggregation. Consider using sorting and/or set sufficient size parameter.

This also applies for the second case, COUNT() (... HAVING), which will only count those buckets that are actually present in the output of aggregation.

In case this query is too heavy or the number of buckets too big, consider denormalizing your data or store this sum directly in the document, so you can use plain range query to achieve your goal.

这篇关于SQL 类似于 GROUP BY AND HAVING的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆