配置单元SQL查询,用于用日期范围内最近的值填充表中缺少的日期值 [英] Hive SQL query to fill missing date values in table with nearest values between date range

查看:22
本文介绍了配置单元SQL查询,用于用日期范围内最近的值填充表中缺少的日期值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于下面的示例,如果我使用下面相同的数据,并且如果我希望Mary和Peter帐户在相同的日期范围内,我将如何修改配置单元SQL查询来执行此操作?例如,将日期范围设置在‘2021-05-24’和‘2021-06-03’之间,并填写此期间的所有余额。如果我们以Mary为例,我还希望看到Mary可用余额‘53028.1’向前填充到‘2021-06-03’,并且如果Mary没有‘2021-05-24’的值,则会用余额‘2021-05-24’将其填回。

with mytable as (--Demo dataset, use your table instead of this
select stack(10, --number of tuples
'Peter',float(50000),'2021-05-24',
'Peter',float(50035),'2021-05-25',
'Peter',float(50035),'2021-05-26',
'Peter',float(50610),'2021-05-28',
'Peter',float(51710),'2021-06-01',
'Peter',float(53028.1),'2021-06-02',
'Peter',float(53916.1),'2021-06-03',
'Mary',float(50000),'2021-05-24',
'Mary',float(50035),'2021-05-25',
'Mary',float(53028.1),'2021-05-30'
) as (account_name,available_balance,Date_of_balance)
) --use your table instead of this CTE

select  account_name, available_balance, date_add(Date_of_balance,e.i) as Date_of_balance
from
( --Get next_date to generate date range
select account_name,available_balance,Date_of_balance,
       lead(Date_of_balance,1, Date_of_balance) over (partition by account_name order by Date_of_balance) next_date    
  from mytable d  --use your table
) s lateral view outer posexplode(split(space(datediff(next_date,Date_of_balance)-1),'')) e as i,x --generate rows
order by account_name desc, Date_of_balance --this is to have order of rows like in your Converted Table

结果:

account_name    available_balance   date_of_balance 
Peter           50000                2021-05-24
Peter           50035                2021-05-25
Peter           50035                2021-05-26
Peter           50035                2021-05-27
Peter           50610                2021-05-28
Peter           50610                2021-05-29
Peter           50610                2021-05-30
Peter           50610                2021-05-31
Peter           51710                2021-06-01
Peter           53028.1              2021-06-02
Peter           53916.1              2021-06-03
Mary            50000                2021-05-24
Mary            50035                2021-05-25
Mary            50035                2021-05-26
Mary            50035                2021-05-27
Mary            50035                2021-05-28
Mary            50035                2021-05-29
Mary            53028.1              2021-05-30

注意,这个左连接帮助我在附加链接here

中走到了这一步

@Left Join

我有一张非常大的桌子,我需要过去90天每天的余额。账户数量超过100万个账户,余额表庞大,余额记录只有在账户余额发生变化时才会更新。某些帐户可能一年多都没有更新余额日期记录,因此由Left Join建议的以下代码不会真正起作用。

我有两个表:

**Accounts lookup table:** 

account_name, observation_date
'Peter','2021-05-24'
'Luis','2021-03-21'

余额表

account_name,account_balance,balance_date
'Peter',50000,'2020-03-20'
'Peter',50035,'2021-04-27'
'Peter',43821,'2021-05-21'
'Peter',50610,'2021-05-22'
'Mary',51710,'2019-03-20'
'Mary',53028.1,'2021-04-27'
'Mary',53916.1,'2021-05-21'
'Mary',54632.76,'2021-05-22'
'Roger',55147.76,'2021-03-03'
'Roger',55293.96,'2021-02-03'
'Roger',57142.15,'2021-03-04'
'Roger',67834.15,'2021-04-01'

我正在寻找的配置单元SQL查询将能够连接这两个表并提供如下所示的结果

account_name,account_balance,balance_date
Peter,50000,2020-03-20
Peter,50000,2021-02-24
Peter,50000,2021-02-25
Peter,…,…
Peter,50035,2021-04-27
Peter,50035,2021-04-28
Peter,50035,2021-04-29
Peter,…,…
Peter,43821,2021-05-21
Peter,50610,2021-05-22
Peter,43821,2021-05-23
Peter,43821,2021-05-24
Roger,55147.76,05/01/2021
Roger,55147.76,06/01/2021
Roger,55147.76,07/01/2021
Roger,…,…
Roger,55293.96,2021-02-03
Roger,57142.15,2021-02-04
Roger,57142.15,2021-02-05
Roger,…,…
Roger,67834.15,2021-04-01
Roger,67834.15,2021-04-02
Roger,67834.15,2021-04-03
Roger,67834.15,2021-04-04
Roger,67834.15,2021-04-05

我知道我们可能会从一开始就获取所有余额,然后执行Lead函数,但对于大规模环境,这在每天以百万计的查询时不起作用。

推荐答案

另外计算整个数据集的最小日期和最大日期以确定所需的日期范围,还计算每个帐户的最小日期以检查最小日期是否需要固定。然后为两个日期添加额外的计算步骤:检查是否为边界日期,如果不是要求的,则相应地分配最小值和最大值。

在本例中,Peter开始日期是2021-05-24,而Mary从2021-05-23开始,因此,范围扩大了,为Peter生成了2021-05-23记录。 对于Mary,上次日期为2021-05-30,范围末尾生成的缺少行。

with mytable as (--Demo dataset, use your table instead of this
select stack(10, --number of tuples
'Peter',float(50000),'2021-05-24',
'Peter',float(50035),'2021-05-25',
'Peter',float(50035),'2021-05-26',
'Peter',float(50610),'2021-05-28',
'Peter',float(51710),'2021-06-01',
'Peter',float(53028.1),'2021-06-02',
'Peter',float(53916.1),'2021-06-03', -------------end date greater than Mary
'Mary',float(50000),'2021-05-23', ----------------start date Less than Peter
'Mary',float(50035),'2021-05-25',
'Mary',float(53028.1),'2021-05-30'
) as (account_name,available_balance,Date_of_balance)
) --use your table instead of this CTE

select  account_name, available_balance, date_add(Date_of_balance,e.i) as Date_of_balance
from
(select account_name, available_balance, 
        case when min_date < min_date_account and Date_of_balance = min_date_account then min_date 
             else Date_of_balance 
         end Date_of_balance,
        
        case when (next_date is null) and (Date_of_balance = max_date) then Date_of_balance 
             when (Date_of_balance < max_date) then nvl(next_date,date_add(max_date,1)) 
         end as next_date
from
( --Get next_date to generate date range
select account_name,available_balance,Date_of_balance,
       lead(Date_of_balance,1) over (partition by account_name order by Date_of_balance) next_date,
       max(Date_of_balance) over() max_date, --total min and max dates all accounts should align
       min(Date_of_balance) over() min_date, 
       min(Date_of_balance) over(partition by account_name) min_date_account
  from mytable d  --use your table
) s 
) s lateral view outer posexplode(split(space(datediff(next_date,Date_of_balance)-1),'')) e as i,x --generate rows
order by account_name desc, Date_of_balance --this is to have order of rows like in your Converted Table

结果:

account_name    available_balance   date_of_balance
Peter             50000              2021-05-23
Peter             50000              2021-05-24
Peter             50035              2021-05-25
Peter             50035              2021-05-26
Peter             50035              2021-05-27
Peter             50610              2021-05-28
Peter             50610              2021-05-29
Peter             50610              2021-05-30
Peter             50610              2021-05-31
Peter             51710              2021-06-01
Peter             53028.1            2021-06-02
Peter             53916.1            2021-06-03
Mary              50000              2021-05-23
Mary              50000              2021-05-24
Mary              50035              2021-05-25
Mary              50035              2021-05-26
Mary              50035              2021-05-27
Mary              50035              2021-05-28
Mary              50035              2021-05-29
Mary              53028.1            2021-05-30
Mary              53028.1            2021-05-31
Mary              53028.1            2021-06-01
Mary              53028.1            2021-06-02
Mary              53028.1            2021-06-03

请注意,lead函数的计算方式也不同,它没有缺省值,空值表示可用的结束日期

这篇关于配置单元SQL查询,用于用日期范围内最近的值填充表中缺少的日期值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆