HIVE-基于日期的带有窗口的分区计算统计信息 [英] HIVE - compute statistics over partitions with window based on date

查看:95
本文介绍了HIVE-基于日期的带有窗口的分区计算统计信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我见过类似我的问题的解决方案,但没有一个对我有用.我也相信应该有一种方法可以使它工作.

I've seen solutions for problems similar to mine, but none quite work for me. Also I'm confident that there should be a way to make it work.

通过以下方式提供表格

<身体>
ID 日期目标
1 2020-01-01 1
1 2020-01-02 1
1 2020-01-03 0
1 2020-01-04 1
1 2020-01-04 0
1 2020-06-01 1
1 2020-06-02 1
1 2020-06-03 0
1 2020-06-04 1
1 2020-06-04 0
2 2020-01-01 1

ID为BIGINT,目标为Int,日期为DATE

ID is BIGINT, target is Int and Date is DATE

我想为每个ID/日期计算日期(包括该日期)之前的3个月和12个月中同一ID的总和和行数.输出示例:

I want to compute, for each ID/Date, the sum and the number of rows for the same ID in the 3 months and 12 months before the Date (inclusive). Example of output:

<身体>
ID 日期 Sum_3 Count_3 Sum_12 Count_12
1 2020-01-01 1 1 1 1
1 2020-01-02 2 2 2 2
1 2020-01-03 2 3 2 3
1 2020-01-04 3 5 3 5
1 2020-06-01 1 1 4 6
1 2020-06-02 2 2 5 7
1 2020-06-03 2 3 6 8
1 2020-06-04 3 5 7 10
2 2020-01-01 1 1 1 1

我这次如何获得HIVE结果?我不确定是否应该使用解析函数(以及如何使用),分组依据等?

How can I get this time of results in HIVE? I'm not sure if I should use analytical functions (and how), group by, etc...?

推荐答案

如果您可以估计间隔时间(1个月= 30天):(改进了GMB的答案)

If you can do an estimation of interval (1 month = 30 days): (an improvement of GMB's answer)

with t as (
    select ID, Date,
        sum(target) target,
        count(target) c_target
    from table
    group by ID, Date
)
select ID, Date,
    sum(target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 90 preceding
    ) sum_3,
    sum(c_target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 90 preceding
    ) count_3,
    sum(target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 360 preceding
    ) sum_12,
    sum(c_target) over(
        partition by ID
        order by unix_timestamp(Date, 'yyyy-MM-dd')
        range 60 * 60 * 24 * 360 preceding
    ) count_12
from t

或者,如果您想要精确的间隔,则可以进行自我联接(但价格昂贵):

Or if you want exact intervals, you can do self joins (but expensive):

with t as (
    select ID, Date,
        sum(target) target,
        count(target) c_target
    from table
    group by ID, Date
)
select
    t_3month.ID, 
    t_3month.Date, 
    t_3month.sum_3, 
    t_3month.count_3, 
    sum(t3.target) sum_12, 
    sum(t3.c_target) count_12
from (
    select 
        t1.ID, 
        t1.Date,
        sum(t2.target) sum_3,
        sum(t2.c_target) count_3
    from t t1
    left join t t2
    on t2.Date > t1.Date - interval 3 month and
       t2.Date <= t1.Date and
       t1.ID = t2.ID
    group by t1.ID, t1.Date
) t_3month
left join t t3
on t3.Date > t_3month.Date - interval 12 month and
   t3.Date <= t_3month.Date and
   t_3month.ID = t3.ID
group by t_3month.ID, t_3month.Date, t_3month.sum_3, t_3month.count_3
order by ID, Date;

这篇关于HIVE-基于日期的带有窗口的分区计算统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆