有没有一种方法可以在COUNT聚合分析函数中使用ORDER BY子句?如果没有,什么是合适的替代方案? [英] Is there a way to use ORDER BY clause in COUNT aggregate analytic function? If not, what is a suitable alternative?

查看:33
本文介绍了有没有一种方法可以在COUNT聚合分析函数中使用ORDER BY子句?如果没有,什么是合适的替代方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张订单表,看起来像这样:

I have a table of orders that looks something like this:

WITH my_table_of_orders AS (
  SELECT
    1 AS order_id,
    DATE(2019, 5, 12) AS date,
    5 AS customer_id,
    TRUE AS is_from_particular_store

  UNION ALL SELECT
    2 AS order_id,
    DATE(2019, 5, 11) AS date,
    5 AS customer_id,
    TRUE AS is_from_particular_store

  UNION ALL SELECT
    3 AS order_id,
    DATE(2019, 5, 11) AS date,
    4 AS customer_id,
    FALSE AS is_from_particular_store
)

我的实际表包含约5900万行.

My actual table contains ~59 million rows.

我想做的基本上是每个订单日期返回一行,第二列表示在过去一年中(相对于当前行的日期)下订单的客户所占的百分比.特定商店(我的虚拟 is_from_particular_store 列就派上用场了.)

What I would like to do is essentially return one row, per order date, with a second column that represents what percentage of customers that placed orders in the past year (relative to the current row's date), placed an order with a particular store (where my fictitious is_from_particular_store column comes in handy).

理想情况下,我可以使用以下查询,而不会遇到资源问题..唯一的问题是,在解析函数中使用 DISTINCT 时,您不能使用 ORDER BY ,我得到了这个如果指定DISTINCT,则不允许使用窗口ORDER BY :

Ideally I could use the following query and not run into resource issues.. only problem is that you cannot use ORDER BY when using DISTINCT in an analytic function it seems, I get this Window ORDER BY is not allowed if DISTINCT is specified:

SELECT
  date,
  last_year_customer_id_that_ordered_from_a_particular_store / last_year_customer_id_that_ordered AS number_i_want
FROM (
  SELECT
    date,
    ROW_NUMBER() OVER (
      PARTITION BY
        date
    ) AS row_num,
    COUNT(DISTINCT customer_id) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered,
    COUNT(DISTINCT IF(is_from_particular_store, customer_id, NULL)) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered_from_a_particular_store,
  FROM my_table_of_orders
)
WHERE
  -- only return one row per date
  row_num = 1

然后我尝试使用 ARRAY_AGG UNNEST 代替:

I then tried using ARRAY_AGG and UNNEST instead:

SELECT
  date,
  SAFE_DIVIDE((SELECT COUNT(DISTINCT customer_id)
    FROM UNNEST(last_year_customer_id_that_ordered_from_a_particular_store) AS customer_id
  ), (SELECT COUNT(DISTINCT customer_id)
    FROM UNNEST(last_year_customer_id_that_ordered) AS customer_id
  )) AS number_i_want_to_calculate
FROM (
  SELECT
    date,
    ROW_NUMBER() OVER (
      PARTITION BY
        date
    ) AS row_num,
    ARRAY_AGG(customer_id) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered,
    ARRAY_AGG(IF(is_from_particular_store, customer_id, NULL)) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered_from_a_particular_store,
  FROM my_table_of_orders
)
WHERE
  -- only return one row per date
  row_num = 1

唯一的问题是我遇到了以下资源问题...

The only problem with this is that I get the following resource issue...

Resources exceeded during query execution: The query could not be executed in the allotted memory.

此问题与 https://stackoverflow.com/a/42567839/3902555 极为相似,建议使用 ARRAY_AGG + UNNEST ,但就像我说的那样,这对我造成了资源问题:(

This question is incredibly similar https://stackoverflow.com/a/42567839/3902555 and suggests using ARRAY_AGG + UNNEST but like I said this causes resource issues for me :(

有人知道一种更节省资源的方法来计算我所关注的统计信息吗?

Anyone know of a more resource efficient way to calculate the statistic I am after?

推荐答案

另一个完全重构的版本(BigQuery Standard SQL)

Another totally refactored version (BigQuery Standard SQL)

#standardSQL
WITH temp AS (
  SELECT DISTINCT DATE, customer_id, is_from_particular_store
  FROM my_table_of_orders
)
SELECT a.date, 
  SAFE_DIVIDE(
    COUNT(DISTINCT IF(b.is_from_particular_store, b.customer_id, NULL)),
    COUNT(DISTINCT b.customer_id)
  ) AS number_i_want_to_calculate
FROM temp a
CROSS JOIN temp b
WHERE DATE_DIFF(a.date, b.date, YEAR) < 1
GROUP BY a.date   

上述替代方法是使用近似聚合如下例所示

Alternative to above is using Approximate Aggregation as in below example

#standardSQL
WITH temp AS (
  SELECT DISTINCT DATE, customer_id, is_from_particular_store
  FROM my_table_of_orders
)
SELECT a.date, 
  SAFE_DIVIDE(
    APPROX_COUNT_DISTINCT(IF(b.is_from_particular_store, b.customer_id, NULL)),
    APPROX_COUNT_DISTINCT(b.customer_id)
  ) AS number_i_want_to_calculate
FROM temp a
CROSS JOIN temp b
WHERE DATE_DIFF(a.date, b.date, YEAR) < 1
GROUP BY a.date

这篇关于有没有一种方法可以在COUNT聚合分析函数中使用ORDER BY子句?如果没有,什么是合适的替代方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆