多列数据可用的最近日期的平均值 [英] Average on most recent date for which data is available for multiple columns

查看:43
本文介绍了多列数据可用的最近日期的平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表格(在BigQuery上),如下所示:

I have a table (on BigQuery) that looks like the following:

|    Date    |    Type    |    Score1   |    Score2   |
|------------|------------|-------------|-------------|
| 2021-01-04 |      A     |      5      |     NULL    |
| 2021-01-04 |      A     |      4      |     NULL    |
| 2021-01-04 |      A     |      5      |     NULL    |
| 2021-01-02 |      A     |      1      |     NULL    |
| 2021-01-02 |      A     |      1      |     NULL    |
| 2021-01-01 |      A     |      3      |      2      |
| 2021-01-04 |      B     |     NULL    |      3      |
| 2021-01-04 |      B     |     NULL    |     NULL    |
| 2021-01-02 |      B     |     NULL    |      4      |
| 2021-01-02 |      B     |     NULL    |      4      |
| 2021-01-01 |      B     |      2      |      5      |
| 2021-01-01 |      B     |      5      |      3      |
| 2021-01-04 |      C     |     NULL    |     NULL    |
| 2021-01-04 |      C     |      4      |     NULL    |
| 2021-01-04 |      C     |     NULL    |     NULL    |
| 2021-01-01 |      C     |      1      |      5      |
| 2021-01-01 |      C     |      2      |      4      |
| 2021-01-01 |      C     |      3      |      4      |

我想得到的是每种类型的平均分数,但该平均值应仅在该类型至少可获得一个分数的最近日期获得.从上面的示例中,目标是在一个查询(可以包含子查询)中获取下表:

What I would like to get is the average score for each type but the average should be taken only on the most recent date for which at least one score is available for the type. From the example above, the aim is to obtain the following table in one query (that can contain subqueries):

|    Type    |  AVG Score1 |  AVG Score2 |
|------------|-------------|-------------|
|      A     |  (5+4+5)/3  |    (2)/1    |
|      B     |   (2+5)/2   |    (3)/1    |
|      C     |    (4)/1    |  (5+4+4)/3  |

如果我希望平均得分(不是针对每种类型,而是针对两列(类型/颜色)的每种组合)的平均得分,仍然需要至少一个得分可用的最新日期,我需要一个可以调整的解决方案对于组合.如果我有更多的得分列并且针对不同的汇总(AVG/MAX/MIN ...),也应该有可能对其进行调整.

I need a solution that could be adapted if I want the average score, not for each type, but for each combination of two columns (type/color), still on the most recent date for which at least one score is available for the combination. It should be also possible to adapt it if I have more score columns and for different aggregations (AVG/MAX/MIN...).

NB:第一个问题被问到只用一个分数列来处理相同的问题:

N.B.: A first question was asked to handle the same problem with only one score column: Average on the most recent date for which data is available.

推荐答案

您可以采用相同的方法,但是需要分别枚举每一列-因此过滤不起作用.那应该是:

You can follow the same approach, but you need to enumerate each column separately -- so filtering doesn't work. That would be:

select type,
       avg(case when seqnum_1 = 1 then score1 end) as avg_1,
       avg(case when seqnum_2 = 1 then score1 end) as avg_2
from (select t.*,
             dense_rank() over (partition by type, score1 is null order by date desc) as seqnum_1,
             dense_rank() over (partition by type, score2 is null order by date desc) as seqnum_2
      from t
     ) t
group by type;

注意:这包括平均值中的 NULL 值.这样做没有任何危害,因为它们不会影响结果.您也可以将其表示为:

Note: This includes the NULL values in the averages. There is no harm in that, because they don't affect the results. You could also express this as:

select type,
       avg(case when seqnum_1 = 1 then score1 end) as avg_1,
       avg(case when seqnum_2 = 1 then score1 end) as avg_2
from (select t.*,
             dense_rank() over (partition by type order by score1 is not null desc, date desc) as seqnum_1,
             dense_rank() over (partition by type order by score2 is not null desc, date desc) as seqnum_2
      from t
     ) t
group by type;

这篇关于多列数据可用的最近日期的平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆