HIVE-基于日期的带有窗口的分区计算统计信息 [英] HIVE - compute statistics over partitions with window based on date
问题描述
我见过类似我的问题的解决方案,但没有一个对我有用.我也相信应该有一种方法可以使它工作.
I've seen solutions for problems similar to mine, but none quite work for me. Also I'm confident that there should be a way to make it work.
通过以下方式提供表格
ID | 日期 | 目标 |
---|---|---|
1 | 2020-01-01 | 1 |
1 | 2020-01-02 | 1 |
1 | 2020-01-03 | 0 |
1 | 2020-01-04 | 1 |
1 | 2020-01-04 | 0 |
1 | 2020-06-01 | 1 |
1 | 2020-06-02 | 1 |
1 | 2020-06-03 | 0 |
1 | 2020-06-04 | 1 |
1 | 2020-06-04 | 0 |
2 | 2020-01-01 | 1 |
ID为BIGINT,目标为Int,日期为DATE
ID is BIGINT, target is Int and Date is DATE
我想为每个ID/日期计算日期(包括该日期)之前的3个月和12个月中同一ID的总和和行数.输出示例:
I want to compute, for each ID/Date, the sum and the number of rows for the same ID in the 3 months and 12 months before the Date (inclusive). Example of output:
ID | 日期 | Sum_3 | Count_3 | Sum_12 | Count_12 |
---|---|---|---|---|---|
1 | 2020-01-01 | 1 | 1 | 1 | 1 |
1 | 2020-01-02 | 2 | 2 | 2 | 2 |
1 | 2020-01-03 | 2 | 3 | 2 | 3 |
1 | 2020-01-04 | 3 | 5 | 3 | 5 |
1 | 2020-06-01 | 1 | 1 | 4 | 6 |
1 | 2020-06-02 | 2 | 2 | 5 | 7 |
1 | 2020-06-03 | 2 | 3 | 6 | 8 |
1 | 2020-06-04 | 3 | 5 | 7 | 10 |
2 | 2020-01-01 | 1 | 1 | 1 | 1 |
我这次如何获得HIVE结果?我不确定是否应该使用解析函数(以及如何使用),分组依据等?
How can I get this time of results in HIVE? I'm not sure if I should use analytical functions (and how), group by, etc...?
推荐答案
如果您可以估计间隔时间(1个月= 30天):(改进了GMB的答案)
If you can do an estimation of interval (1 month = 30 days): (an improvement of GMB's answer)
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select ID, Date,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) sum_3,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 90 preceding
) count_3,
sum(target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) sum_12,
sum(c_target) over(
partition by ID
order by unix_timestamp(Date, 'yyyy-MM-dd')
range 60 * 60 * 24 * 360 preceding
) count_12
from t
或者,如果您想要精确的间隔,则可以进行自我联接(但价格昂贵):
Or if you want exact intervals, you can do self joins (but expensive):
with t as (
select ID, Date,
sum(target) target,
count(target) c_target
from table
group by ID, Date
)
select
t_3month.ID,
t_3month.Date,
t_3month.sum_3,
t_3month.count_3,
sum(t3.target) sum_12,
sum(t3.c_target) count_12
from (
select
t1.ID,
t1.Date,
sum(t2.target) sum_3,
sum(t2.c_target) count_3
from t t1
left join t t2
on t2.Date > t1.Date - interval 3 month and
t2.Date <= t1.Date and
t1.ID = t2.ID
group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
t3.Date <= t_3month.Date and
t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;
这篇关于HIVE-基于日期的带有窗口的分区计算统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!