redshift:通过窗口分区计算不重复的客户 [英] redshift: count distinct customers over window partition

查看:106
本文介绍了redshift:通过窗口分区计算不重复的客户的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Redshift在其窗口函数中不支持 DISTINCT 聚合. COUNT 的AWS文档说明了此 distinct 不受任何窗口功能的支持.

Redshift doesn't support DISTINCT aggregates in its window functions. AWS documentation for COUNT states this, and distinct isn't supported for any of the window functions.

我希望获得当年的月度和年初至今唯一客户数量,并按流量渠道以及所有渠道的总数进行划分.由于一个客户可以拜访不止一次,因此我只需要计算不同的客户,因此Redshift窗口汇总将无济于事.

I desire monthly and YTD unique customer counts for the current year, and also split by traffic channel as well as total for all channels. Since a customer can visit more than once I need to count only distinct customers, and therefore the Redshift window aggregates won't help.

  • 我可以使用 count(distinct customer_id)... group by 来计数不同的客户,但这只会给我四个所需结果的一个.
  • 习惯养成对一堆 union 之间堆积的每个所需计数进行完整查询的习惯.我希望这不是唯一的解决方案.
  • I can count distinct customers using count(distinct customer_id)...group by, but this will give me only a single result of the four needed.
  • I don't want to get into the habit of running a full query for each desired count piled up between a bunch of union all. I hope this is not the only solution.

这就是我将在postgres(或Oracle)中写的内容:

This is what I would write in postgres (or Oracle for that matter):

select order_month
       , traffic_channel
       , count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
       , count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
       , count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
       , count(distinct customer_id) over() as ytd_total_customers

from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */

where to_char(order_month, 'YYYY') = '2017'

如何在Redshift中解决此问题?

结果需要在redshift集群上工作,此外,这是一个简化的问题,实际的期望结果具有产品类别和客户类型,这乘以所需分区的数量.因此, union all 汇总的堆栈不是一个很好的解决方案.

The result needs to work on a redshift cluster, furthermore this is a simplified problem and the actual desired result has product category and customer type, which multiplies the number of partitions needed. Therefore a stack of union all rollups is not a nice solution.

推荐答案

A 2016年的博客文章指出了这个问题,并提供了基本的解决方法,因此,谢谢Mark D. Adams.奇怪的是,我在所有的网络上都找不到,因此我正在共享我的(经过测试的)解决方案.

A blog post from 2016 calls out this problem and provides a rudimentary workaround, so thank you Mark D. Adams. There is strangely very little I could find on all of the web therefore I'm sharing my (tested) solution.

关键见解是,按相关商品排序的 dense_rank()为相同商品提供了相同的等级,因此,最高等级也是唯一商品的计数.如果您尝试为我想要的每个分区交换以下内容,那就太糟了:

The key insight is that dense_rank(), ordered by the item in question, provides the same rank to identical items, and therefore the highest rank is also the count of unique items. This is a horrible mess if you try to swap in the following for each partition I want:

dense_rank() over(partition by order_month, traffic_channel order by customer_id)

由于您需要最高的排名,因此您必须对所有内容进行子查询,然后从每个获得的排名中选择最大值.重要的是将外部查询中的分区与子查询中的相应分区进行匹配.

Since you need the highest rank, you have to subquery everything and select the max value from each ranking taken. Its important to match the partitions in the outer query to the corresponding partition in the subquery.

/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
       order_month
       , traffic_channel
       , max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
       , max(tc_rnk) over(partition by traffic_channel)  ytd_customers_by_channel
       , max(mth_rnk) over(partition by order_month)  monthly_customers_all_channels
       , max(cust_rnk) over()  ytd_total_customers

from (
       select order_month
              , traffic_channel
              , dense_rank() over(partition by order_month, traffic_channel order by customer_id)  tc_mth_rnk
              , dense_rank() over(partition by traffic_channel order by customer_id)  tc_rnk
              , dense_rank() over(partition by order_month order by customer_id)  mth_rnk
              , dense_rank() over(order by customer_id)  cust_rnk

       from orders_traffic_channels

       where to_char(order_month, 'YYYY') = '2017'
     )

order by order_month, traffic_channel
;

笔记

  • max() dense_rank()的分区必须匹配
  • dense_rank()将对空值进行排名(所有排名均在同一排名,即最大值).如果您不希望计算<空值> ,则需要一个 case,当customer_id不为null时,则使用density_rank()... etc ... ,或者您可以从空值中减去一个. max()(如果您知道存在null).
  • notes

    • partitions of max() and dense_rank() must match
    • dense_rank() will rank null values (all at the same rank, the max). If you want to not count null values you need a case when customer_id is not null then dense_rank() ...etc..., or you can subtract one from the max() if you know there are nulls.
    • 这篇关于redshift:通过窗口分区计算不重复的客户的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆