具有缺失值的BIGQUERY移动平均线 [英] BIGQUERY moving average with missing values
问题描述
我有以下数据
with dummy_data as
(
SELECT '2017-01-01' as ref_month, 18 as value, 1 as id
UNION ALL SELECT '2017-02-01' as ref_month, 20 as value, 1 as id
UNION ALL SELECT '2017-03-01' as ref_month, 22 as value, 1 as id
-- UNION ALL SELECT '2017-04-01' as ref_month, 28 as value, 1 as id
UNION ALL SELECT '2017-05-01' as ref_month, 30 as value, 1 as id
UNION ALL SELECT '2017-06-01' as ref_month, 37 as value, 1 as id
UNION ALL SELECT '2017-07-01' as ref_month, 42 as value, 1 as id
-- UNION ALL SELECT '2017-08-01' as ref_month, 55 as value, 1 as id
-- UNION ALL SELECT '2017-09-01' as ref_month, 49 as value, 1 as id
UNION ALL SELECT '2017-10-01' as ref_month, 51 as value, 1 as id
UNION ALL SELECT '2017-11-01' as ref_month, 57 as value, 1 as id
UNION ALL SELECT '2017-12-01' as ref_month, 56 as value, 1 as id
UNION ALL SELECT '2017-01-01' as ref_month, 18 as value, 2 as id
UNION ALL SELECT '2017-02-01' as ref_month, 20 as value, 2 as id
UNION ALL SELECT '2017-03-01' as ref_month, 22 as value, 2 as id
UNION ALL SELECT '2017-04-01' as ref_month, 28 as value, 2 as id
-- UNION ALL SELECT '2017-05-01' as ref_month, 30 as value, 2 as id
-- UNION ALL SELECT '2017-06-01' as ref_month, 37 as value, 2 as id
UNION ALL SELECT '2017-07-01' as ref_month, 42 as value, 2 as id
UNION ALL SELECT '2017-08-01' as ref_month, 55 as value, 2 as id
UNION ALL SELECT '2017-09-01' as ref_month, 49 as value, 2 as id
-- UNION ALL SELECT '2017-10-01' as ref_month, 51 as value, 2 as id
UNION ALL SELECT '2017-11-01' as ref_month, 57 as value, 2 as id
UNION ALL SELECT '2017-12-01' as ref_month, 56 as value, 2 as id
)
我想计算每个id的移动平均值.我知道您可以执行以下操作
And i want to calculate the moving average for each id. I know you can do something like the following
select
id
, ref_month
, avg(value) over (partition by id order by ref_month ROWS BETWEEN 5 PRECEDING AND CURRENT ROW ) as moving_avg
from
dummy_data
但是,正如您从我的虚拟数据中看到的那样,存在一些缺失值. 关于缺少一些值时如何轻松计算移动平均的任何想法? 我当时想先计算一个完整的日期范围
but as you can see from my dummy data, there are some missing values. Any ideas on how to calculate the moving average easily when there are some missing values? I was thinking to calculate first a full date range
date_range AS
(
SELECT reference_month
FROM UNNEST(
GENERATE_DATE_ARRAY(PARSE_DATE('%Y-%m-%d', (SELECT MIN(ref_month) FROM dummy_data)), PARSE_DATE('%Y-%m-%d', (SELECT MAX(ref_month) FROM dummy_data)), INTERVAL 1 MONTH)
) AS reference_month
)
,然后使用id进行笛卡尔乘积,然后与我的虚拟数据重新加入,但这似乎是一种反模式.关于如何最佳地执行此操作的任何想法? 谢谢
and then do a cartesian products with the ids and then join back with my dummy data, but this seems an anti pattern. Any idea on how to do this optimally? Thanks
预期结果: 对于ID 1:
expected result: For id 1:
2017-01-01 18
2017-02-01 19
2017-03-01 20
2017-05-01 18
2017-06-01 21.8
2017-07-01 26.2
2017-10-01 26
2017-11-01 30
2017-12-01 32.8
对于ID 2:
2017-01-01 18
2017-02-01 19
2017-03-01 20
2017-04-01 22
2017-07-01 18.4
2017-08-01 25
2017-09-01 29.2
2017-11-01 40.6
2017-12-01 43.4
推荐答案
以下内容适用于BigQuery Standard SQL,并且可以正常使用! :o)
假设您的ref_month是DATE
数据类型(如果您的情况是STRING
-仍然可以-请参见我答案底部的注释)
Below is for BigQuery Standard SQL and actually works! :o)
It assumes that your ref_month is of DATE
data type (if in your case you have it as STRING
- still Okay - see note at the very bottom of my answer)
#standardSQL
SELECT
id,
ref_month,
ROUND(SUM(value) OVER (rolling_six_days) /
(LAST_VALUE(month_pos) OVER (rolling_six_days)
- FIRST_VALUE(month_pos) OVER (rolling_six_days)
+ 1)
) AS correct_moving_avg
FROM (
SELECT id, ref_month, value,
DATE_DIFF(ref_month, '2016-01-01', MONTH) month_pos
FROM dummy_data
)
WINDOW rolling_six_days AS
(PARTITION BY id ORDER BY month_pos RANGE BETWEEN 5 PRECEDING AND CURRENT ROW )
您可以使用下面的示例数据来测试/玩游戏
You can test / play with it using your example data as below
#standardSQL
WITH dummy_data AS (
SELECT DATE '2017-01-01' AS ref_month, 18 AS value, 1 AS id
UNION ALL SELECT DATE '2017-02-01' AS ref_month, 20 AS value, 1 AS id
UNION ALL SELECT DATE '2017-03-01' AS ref_month, 22 AS value, 1 AS id
-- UNION ALL SELECT DATE '2017-04-01' as ref_month, 28 as value, 1 as id
UNION ALL SELECT DATE '2017-05-01' AS ref_month, 30 AS value, 1 AS id
UNION ALL SELECT DATE '2017-06-01' AS ref_month, 37 AS value, 1 AS id
UNION ALL SELECT DATE '2017-07-01' AS ref_month, 42 AS value, 1 AS id
-- UNION ALL SELECT DATE '2017-08-01' as ref_month, 55 as value, 1 as id
-- UNION ALL SELECT DATE '2017-09-01' as ref_month, 49 as value, 1 as id
UNION ALL SELECT DATE '2017-10-01' AS ref_month, 51 AS value, 1 AS id
UNION ALL SELECT DATE '2017-11-01' AS ref_month, 57 AS value, 1 AS id
UNION ALL SELECT DATE '2017-12-01' AS ref_month, 56 AS value, 1 AS id
UNION ALL SELECT DATE '2017-01-01' AS ref_month, 18 AS value, 2 AS id
UNION ALL SELECT DATE '2017-02-01' AS ref_month, 20 AS value, 2 AS id
UNION ALL SELECT DATE '2017-03-01' AS ref_month, 22 AS value, 2 AS id
UNION ALL SELECT DATE '2017-04-01' AS ref_month, 28 AS value, 2 AS id
-- UNION ALL SELECT DATE '2017-05-01' as ref_month, 30 as value, 2 as id
-- UNION ALL SELECT DATE '2017-06-01' as ref_month, 37 as value, 2 as id
UNION ALL SELECT DATE '2017-07-01' AS ref_month, 42 AS value, 2 AS id
UNION ALL SELECT DATE '2017-08-01' AS ref_month, 55 AS value, 2 AS id
UNION ALL SELECT DATE '2017-09-01' AS ref_month, 49 AS value, 2 AS id
-- UNION ALL SELECT DATE '2017-10-01' as ref_month, 51 as value, 2 as id
UNION ALL SELECT DATE '2017-11-01' AS ref_month, 57 AS value, 2 AS id
UNION ALL SELECT DATE '2017-12-01' AS ref_month, 56 AS value, 2 AS id
)
SELECT
id,
ref_month,
ROUND(SUM(value) OVER (rolling_six_days) /
(LAST_VALUE(month_pos) OVER (rolling_six_days)
- FIRST_VALUE(month_pos) OVER (rolling_six_days)
+ 1)
) AS correct_moving_avg
FROM (
SELECT id, ref_month, value,
DATE_DIFF(ref_month, '2016-01-01', MONTH) month_pos
FROM dummy_data
)
WINDOW rolling_six_days AS (PARTITION BY id ORDER BY month_pos RANGE BETWEEN 5 PRECEDING AND CURRENT ROW )
ORDER BY 1,2
为帮助您探索逻辑-请参见下面上述查询的扩展"版本-它甚至将所有中间值都传播到非常外部的选择范围内,因此您可以看到所有内容...
To help you in exploring logic - see below "expanded" version of above query - it has all even intermediate values propagated up to very outside select so you can see everything ...
#standardSQL
WITH dummy_data AS
(
SELECT DATE '2017-01-01' AS ref_month, 18 AS value, 1 AS id
UNION ALL SELECT DATE '2017-02-01' AS ref_month, 20 AS value, 1 AS id
UNION ALL SELECT DATE '2017-03-01' AS ref_month, 22 AS value, 1 AS id
-- UNION ALL SELECT DATE '2017-04-01' as ref_month, 28 as value, 1 as id
UNION ALL SELECT DATE '2017-05-01' AS ref_month, 30 AS value, 1 AS id
UNION ALL SELECT DATE '2017-06-01' AS ref_month, 37 AS value, 1 AS id
UNION ALL SELECT DATE '2017-07-01' AS ref_month, 42 AS value, 1 AS id
-- UNION ALL SELECT DATE '2017-08-01' as ref_month, 55 as value, 1 as id
-- UNION ALL SELECT DATE '2017-09-01' as ref_month, 49 as value, 1 as id
UNION ALL SELECT DATE '2017-10-01' AS ref_month, 51 AS value, 1 AS id
UNION ALL SELECT DATE '2017-11-01' AS ref_month, 57 AS value, 1 AS id
UNION ALL SELECT DATE '2017-12-01' AS ref_month, 56 AS value, 1 AS id
UNION ALL SELECT DATE '2017-01-01' AS ref_month, 18 AS value, 2 AS id
UNION ALL SELECT DATE '2017-02-01' AS ref_month, 20 AS value, 2 AS id
UNION ALL SELECT DATE '2017-03-01' AS ref_month, 22 AS value, 2 AS id
UNION ALL SELECT DATE '2017-04-01' AS ref_month, 28 AS value, 2 AS id
-- UNION ALL SELECT DATE '2017-05-01' as ref_month, 30 as value, 2 as id
-- UNION ALL SELECT DATE '2017-06-01' as ref_month, 37 as value, 2 as id
UNION ALL SELECT DATE '2017-07-01' AS ref_month, 42 AS value, 2 AS id
UNION ALL SELECT DATE '2017-08-01' AS ref_month, 55 AS value, 2 AS id
UNION ALL SELECT DATE '2017-09-01' AS ref_month, 49 AS value, 2 AS id
-- UNION ALL SELECT DATE '2017-10-01' as ref_month, 51 as value, 2 as id
UNION ALL SELECT DATE '2017-11-01' AS ref_month, 57 AS value, 2 AS id
UNION ALL SELECT DATE '2017-12-01' AS ref_month, 56 AS value, 2 AS id
)
SELECT
id,
ref_month,
value,
moving_sum,
first_month,
last_month,
ROUND(moving_sum / (last_month - first_month + 1)) AS correct_moving_avg,
moving_avg
FROM (
SELECT
id,
ref_month,
value,
SUM(value) OVER (rolling_six_days) AS moving_sum,
FIRST_VALUE(month_pos) OVER (rolling_six_days) AS first_month,
LAST_VALUE(month_pos) OVER (rolling_six_days) AS last_month,
AVG(value) OVER (rolling_six_days) AS moving_avg
FROM (
SELECT
id, ref_month, value,
DATE_DIFF(ref_month, '2016-01-01', MONTH) month_pos
FROM dummy_data
)
WINDOW rolling_six_days AS
(PARTITION BY id ORDER BY month_pos RANGE BETWEEN 5 PRECEDING AND CURRENT ROW )
)
ORDER BY 1,2
结果为
id ref_month value moving_sum first_month last_month correct_moving_avg moving_avg
1 2017-01-01 18 18 12 12 18.0 18.0
1 2017-02-01 20 38 12 13 19.0 19.0
1 2017-03-01 22 60 12 14 20.0 20.0
1 2017-05-01 30 90 12 16 18.0 22.5
1 2017-06-01 37 127 12 17 21.0 25.4
1 2017-07-01 42 151 13 18 25.0 30.2
1 2017-10-01 51 160 16 21 27.0 40.0
1 2017-11-01 57 187 17 22 31.0 46.75
1 2017-12-01 56 206 18 23 34.0 51.5
2 2017-01-01 18 18 12 12 18.0 18.0
2 2017-02-01 20 38 12 13 19.0 19.0
2 2017-03-01 22 60 12 14 20.0 20.0
2 2017-04-01 28 88 12 15 22.0 22.0
2 2017-07-01 42 112 13 18 19.0 28.0
2 2017-08-01 55 147 14 19 25.0 36.75
2 2017-09-01 49 174 15 20 29.0 43.5
2 2017-11-01 57 203 18 22 41.0 50.75
2 2017-12-01 56 259 18 23 43.0 51.8
希望这向您显示/说明了方法
hope this shows/explains you the approach
注意:如果您的ref_month
字段为STRING`数据类型,则应使用DATE_DIFF稍微调整行-应该为
Note: if your ref_month
field is of STRING` data tyoe you should slightly adjust line with DATE_DIFF - it should be as
DATE_DIFF(cast(ref_month as DATE), '2016-01-01', MONTH) month_pos
注2:我选择"2016-01-01"作为计算月份的起点-但您可以选择任意一个以确保它小于您的最小日期-例如"2000-01-01"将也很完美
Note 2: I picked '2016-01-01' as a starting point for counting months - but you can pick any to make sure that it is less than your minimum date - for example '2000-01-01' will perfectly work too
这篇关于具有缺失值的BIGQUERY移动平均线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!