具有缺失值的BIGQUERY移动平均线 [英] BIGQUERY moving average with missing values

查看:84
本文介绍了具有缺失值的BIGQUERY移动平均线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据

with dummy_data as 
(
SELECT '2017-01-01' as ref_month, 18 as value, 1 as id
UNION ALL SELECT '2017-02-01' as ref_month, 20 as value, 1 as id
UNION ALL SELECT '2017-03-01' as ref_month, 22 as value, 1 as id
-- UNION ALL SELECT '2017-04-01' as ref_month, 28 as value, 1 as id
UNION ALL SELECT '2017-05-01' as ref_month, 30 as value, 1 as id
UNION ALL SELECT '2017-06-01' as ref_month, 37 as value, 1 as id
UNION ALL SELECT '2017-07-01' as ref_month, 42 as value, 1 as id
-- UNION ALL SELECT '2017-08-01' as ref_month, 55 as value, 1 as id
-- UNION ALL SELECT '2017-09-01' as ref_month, 49 as value, 1 as id
UNION ALL SELECT '2017-10-01' as ref_month, 51 as value, 1 as id
UNION ALL SELECT '2017-11-01' as ref_month, 57 as value, 1 as id
UNION ALL SELECT '2017-12-01' as ref_month, 56 as value, 1 as id
UNION ALL SELECT '2017-01-01' as ref_month, 18 as value, 2 as id
UNION ALL SELECT '2017-02-01' as ref_month, 20 as value, 2 as id
UNION ALL SELECT '2017-03-01' as ref_month, 22 as value, 2 as id
UNION ALL SELECT '2017-04-01' as ref_month, 28 as value, 2 as id
-- UNION ALL SELECT '2017-05-01' as ref_month, 30 as value, 2 as id
-- UNION ALL SELECT '2017-06-01' as ref_month, 37 as value, 2 as id
UNION ALL SELECT '2017-07-01' as ref_month, 42 as value, 2 as id
UNION ALL SELECT '2017-08-01' as ref_month, 55 as value, 2 as id
UNION ALL SELECT '2017-09-01' as ref_month, 49 as value, 2 as id
-- UNION ALL SELECT '2017-10-01' as ref_month, 51 as value, 2 as id
UNION ALL SELECT '2017-11-01' as ref_month, 57 as value, 2 as id
UNION ALL SELECT '2017-12-01' as ref_month, 56 as value, 2 as id
)

我想计算每个id的移动平均值.我知道您可以执行以下操作

And i want to calculate the moving average for each id. I know you can do something like the following

select 
    id
  , ref_month
  , avg(value) over (partition by id order by ref_month ROWS BETWEEN 5 PRECEDING AND CURRENT ROW ) as moving_avg
from 
    dummy_data

但是,正如您从我的虚拟数据中看到的那样,存在一些缺失值. 关于缺少一些值时如何轻松计算移动平均的任何想法? 我当时想先计算一个完整的日期范围

but as you can see from my dummy data, there are some missing values. Any ideas on how to calculate the moving average easily when there are some missing values? I was thinking to calculate first a full date range

date_range AS
(
  SELECT reference_month
  FROM UNNEST(
      GENERATE_DATE_ARRAY(PARSE_DATE('%Y-%m-%d', (SELECT MIN(ref_month) FROM dummy_data)), PARSE_DATE('%Y-%m-%d', (SELECT MAX(ref_month) FROM dummy_data)), INTERVAL 1 MONTH)
  ) AS reference_month
)

,然后使用id进行笛卡尔乘积,然后与我的虚拟数据重新加入,但这似乎是一种反模式.关于如何最佳地执行此操作的任何想法? 谢谢

and then do a cartesian products with the ids and then join back with my dummy data, but this seems an anti pattern. Any idea on how to do this optimally? Thanks

预期结果: 对于ID 1:

expected result: For id 1:

2017-01-01  18
2017-02-01  19
2017-03-01  20
2017-05-01  18
2017-06-01  21.8
2017-07-01  26.2
2017-10-01  26
2017-11-01  30
2017-12-01  32.8

对于ID 2:

2017-01-01  18
2017-02-01  19
2017-03-01  20
2017-04-01  22
2017-07-01  18.4
2017-08-01  25
2017-09-01  29.2
2017-11-01  40.6
2017-12-01  43.4

推荐答案

以下内容适用于BigQuery Standard SQL,并且可以正常使用! :o)
假设您的ref_month是DATE数据类型(如果您的情况是STRING-仍然可以-请参见我答案底部的注释)

Below is for BigQuery Standard SQL and actually works! :o)
It assumes that your ref_month is of DATE data type (if in your case you have it as STRING - still Okay - see note at the very bottom of my answer)

#standardSQL
SELECT 
  id, 
  ref_month,
  ROUND(SUM(value) OVER (rolling_six_days) / 
    (LAST_VALUE(month_pos) OVER (rolling_six_days) 
      - FIRST_VALUE(month_pos) OVER (rolling_six_days)
      + 1)
  ) AS correct_moving_avg
FROM (
  SELECT id, ref_month, value,
    DATE_DIFF(ref_month, '2016-01-01', MONTH) month_pos
  FROM dummy_data
)
WINDOW rolling_six_days AS 
  (PARTITION BY id ORDER BY month_pos RANGE BETWEEN 5 PRECEDING AND CURRENT ROW )

您可以使用下面的示例数据来测试/玩游戏

You can test / play with it using your example data as below

#standardSQL
WITH dummy_data AS (
  SELECT DATE '2017-01-01' AS ref_month, 18 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-02-01' AS ref_month, 20 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-03-01' AS ref_month, 22 AS value, 1 AS id
  -- UNION ALL SELECT DATE '2017-04-01' as ref_month, 28 as value, 1 as id
  UNION ALL SELECT DATE '2017-05-01' AS ref_month, 30 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-06-01' AS ref_month, 37 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-07-01' AS ref_month, 42 AS value, 1 AS id
  -- UNION ALL SELECT DATE '2017-08-01' as ref_month, 55 as value, 1 as id
  -- UNION ALL SELECT DATE '2017-09-01' as ref_month, 49 as value, 1 as id
  UNION ALL SELECT DATE '2017-10-01' AS ref_month, 51 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-11-01' AS ref_month, 57 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-12-01' AS ref_month, 56 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-01-01' AS ref_month, 18 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-02-01' AS ref_month, 20 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-03-01' AS ref_month, 22 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-04-01' AS ref_month, 28 AS value, 2 AS id
  -- UNION ALL SELECT DATE '2017-05-01' as ref_month, 30 as value, 2 as id
  -- UNION ALL SELECT DATE '2017-06-01' as ref_month, 37 as value, 2 as id
  UNION ALL SELECT DATE '2017-07-01' AS ref_month, 42 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-08-01' AS ref_month, 55 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-09-01' AS ref_month, 49 AS value, 2 AS id
  -- UNION ALL SELECT DATE '2017-10-01' as ref_month, 51 as value, 2 as id
  UNION ALL SELECT DATE '2017-11-01' AS ref_month, 57 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-12-01' AS ref_month, 56 AS value, 2 AS id
)
SELECT 
  id, 
  ref_month,
  ROUND(SUM(value) OVER (rolling_six_days) / 
    (LAST_VALUE(month_pos) OVER (rolling_six_days) 
      - FIRST_VALUE(month_pos) OVER (rolling_six_days)
      + 1)
  ) AS correct_moving_avg
FROM (
  SELECT id, ref_month, value,
    DATE_DIFF(ref_month, '2016-01-01', MONTH) month_pos
  FROM dummy_data
)
WINDOW rolling_six_days AS (PARTITION BY id ORDER BY month_pos RANGE BETWEEN 5 PRECEDING AND CURRENT ROW )
ORDER BY 1,2  

为帮助您探索逻辑-请参见下面上述查询的扩展"版本-它甚至将所有中间值都传播到非常外部的选择范围内,因此您可以看到所有内容...

To help you in exploring logic - see below "expanded" version of above query - it has all even intermediate values propagated up to very outside select so you can see everything ...

#standardSQL
WITH dummy_data AS 
(
  SELECT DATE '2017-01-01' AS ref_month, 18 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-02-01' AS ref_month, 20 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-03-01' AS ref_month, 22 AS value, 1 AS id
  -- UNION ALL SELECT DATE '2017-04-01' as ref_month, 28 as value, 1 as id
  UNION ALL SELECT DATE '2017-05-01' AS ref_month, 30 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-06-01' AS ref_month, 37 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-07-01' AS ref_month, 42 AS value, 1 AS id
  -- UNION ALL SELECT DATE '2017-08-01' as ref_month, 55 as value, 1 as id
  -- UNION ALL SELECT DATE '2017-09-01' as ref_month, 49 as value, 1 as id
  UNION ALL SELECT DATE '2017-10-01' AS ref_month, 51 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-11-01' AS ref_month, 57 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-12-01' AS ref_month, 56 AS value, 1 AS id
  UNION ALL SELECT DATE '2017-01-01' AS ref_month, 18 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-02-01' AS ref_month, 20 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-03-01' AS ref_month, 22 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-04-01' AS ref_month, 28 AS value, 2 AS id
  -- UNION ALL SELECT DATE '2017-05-01' as ref_month, 30 as value, 2 as id
  -- UNION ALL SELECT DATE '2017-06-01' as ref_month, 37 as value, 2 as id
  UNION ALL SELECT DATE '2017-07-01' AS ref_month, 42 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-08-01' AS ref_month, 55 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-09-01' AS ref_month, 49 AS value, 2 AS id
  -- UNION ALL SELECT DATE '2017-10-01' as ref_month, 51 as value, 2 as id
  UNION ALL SELECT DATE '2017-11-01' AS ref_month, 57 AS value, 2 AS id
  UNION ALL SELECT DATE '2017-12-01' AS ref_month, 56 AS value, 2 AS id
)
SELECT 
  id,
  ref_month,
  value,
  moving_sum,
  first_month,
  last_month,
  ROUND(moving_sum / (last_month - first_month + 1)) AS correct_moving_avg,
  moving_avg
FROM (
  SELECT
    id,
    ref_month,
    value,
    SUM(value) OVER (rolling_six_days) AS moving_sum,
    FIRST_VALUE(month_pos) OVER (rolling_six_days) AS first_month,
    LAST_VALUE(month_pos) OVER (rolling_six_days) AS last_month,
    AVG(value) OVER (rolling_six_days) AS moving_avg
  FROM (
    SELECT 
      id, ref_month, value,
      DATE_DIFF(ref_month, '2016-01-01', MONTH) month_pos
    FROM dummy_data
  )
  WINDOW rolling_six_days AS 
    (PARTITION BY id ORDER BY month_pos RANGE BETWEEN 5 PRECEDING AND CURRENT ROW )
)
ORDER BY 1,2   

结果为

id  ref_month   value moving_sum    first_month last_month  correct_moving_avg  moving_avg   
1    2017-01-01 18    18            12          12          18.0                  18.0   
1    2017-02-01 20    38            12          13          19.0                  19.0   
1    2017-03-01 22    60            12          14          20.0                  20.0   
1    2017-05-01 30    90            12          16          18.0                  22.5   
1    2017-06-01 37    127           12          17          21.0                  25.4   
1    2017-07-01 42    151           13          18          25.0                  30.2   
1    2017-10-01 51    160           16          21          27.0                  40.0   
1    2017-11-01 57    187           17          22          31.0                  46.75  
1    2017-12-01 56    206           18          23          34.0                  51.5   
2    2017-01-01 18    18            12          12          18.0                  18.0   
2    2017-02-01 20    38            12          13          19.0                  19.0   
2    2017-03-01 22    60            12          14          20.0                  20.0   
2    2017-04-01 28    88            12          15          22.0                  22.0   
2    2017-07-01 42    112           13          18          19.0                  28.0   
2    2017-08-01 55    147           14          19          25.0                  36.75  
2    2017-09-01 49    174           15          20          29.0                  43.5   
2    2017-11-01 57    203           18          22          41.0                  50.75  
2    2017-12-01 56    259           18          23          43.0                  51.8     

希望这向您显示/说明了方法

hope this shows/explains you the approach

注意:如果您的ref_month字段为STRING`数据类型,则应使用DATE_DIFF稍微调整行-应该为

Note: if your ref_month field is of STRING` data tyoe you should slightly adjust line with DATE_DIFF - it should be as

DATE_DIFF(cast(ref_month as DATE), '2016-01-01', MONTH) month_pos

注2:我选择"2016-01-01"作为计算月份的起点-但您可以选择任意一个以确保它小于您的最小日期-例如"2000-01-01"将也很完美

Note 2: I picked '2016-01-01' as a starting point for counting months - but you can pick any to make sure that it is less than your minimum date - for example '2000-01-01' will perfectly work too

这篇关于具有缺失值的BIGQUERY移动平均线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆