根据配置单元表中的其他表值更新列值 [英] Updating column values based on the other table values in hive tables
问题描述
我在蜂巢中有两个下面的表格
I have two tables like below in hive
stg
.该表基本上是快照表,每天都会被覆盖
stg
. This table is bascially snapshot table which will be overwritten everyday
该表数据每天都会在新的 partition
This table data will be inserted to history
table every day in new partition
第1天
stg
表
+-----+------------+------------+
| pk | from_d | to_d |
+-----+------------+------------+
| 111 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
| 222 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
| 333 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
历史记录
该表按列 load_date
+-----+------------+------------+------------+
| pk | from_d | to_d |load_date |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+
问题陈述:
1)如果我收到 history
表中已经存在的任何 PK
,则需要更新该PK的 to_d
列在历史上.
1) If I receieve any PK
that is already present in history
table then I will need to update the to_d
column for that PK in history.
2)在 STG
表中, to_d
列的值应为 from_d-1天
.
2) The to_d
column should have value of from_d - 1 day
value in the STG
table.
3)还需要考虑如果 PK
在另一天再次出现,那么更新应该只针对历史中的最新记录进行,而不是针对同一PK的所有记录.
3) Also need to consider if the PK
is again coming in another day then the update should happen to only the latest record in history not all the records for the same PK.
请在以下数据示例中检查PK 111
.
Please check PK 111
in the below data examples.
第2天
stg
+-----+------------+------------+
| pk | from_d | to_d |
+-----+------------+------------+
| 111 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+
| 333 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+
| 444 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+
历史记录
表将按如下所示更新
+-----+------------+------------+------------+
| pk | from_d | to_d | load_date |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-02-02 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 111 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 333 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 444 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
要实现上述目标,我首先使用
To achieve the above first I have updated the history table using
insert overwrite table histroy partition(load_date)
select pk, from_d,
case when pk = '111' then '2019-02-01' when pk = '333' then '2019-02-01' else to_d end as to_d,
load_date
from history;
完成后,我将第2天的stg表插入历史记录表
once this is done then I have inserted the day 2 stg table to history table
第3天
stg
+-----+------------+------------+
| pk | from_d | to_d |
+-----+------------+------------+
| 111 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+
| 222 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+
| 555 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+
历史
如下所示
+-----+------------+------------+------------+
| pk | from_d | to_d | load_date |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-03-02 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 111 | 2019-02-02 | 2019-03-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 333 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 444 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 111 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+
| 222 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+
| 555 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+
要达到上述目的,我已经完成
To achieve the above I have done
insert overwrite table histroy partition(load_date)
select pk, from_d,
case when pk = '111' then '2019-03-02' else to_d end as to_d,
load_date
from history
where load_date = '2019-02-02';
insert overwrite table history partition(load_date)
select pk, from_d,
case when pk = '222' then '2019-03-02' else to_d end as to_d,
load_date
from history
where load_date = '2019-01-01';
然后插入 STG
表数据
我正在实现自己想要的目标,但这是一个繁琐的过程,必须有比这更好的方法.
I am achieving what I want but this is a tedious process and there must be better approaches than this one.
注意:对于此问题,我不想使用 Update
语句.插入覆盖是我正在寻找的
Note: I don't want to use the Update
statements for this problem. Insert overwrite is what I am looking for
推荐答案
您可以在下面进行操作
首先创建一个表,并为如下由 PK
分区的每一行分配row_number
First create a table and assign row_number for each row partitioned by PK
like below
create table stg_row_num as select *,
row_number() over ( partition by pk order by load_date desc) as row_num from stg;
上面的查询应为您提供如下表
The above query should give you table like below
+---+----------+----------+----------+--------+
| pk| from_d| to_d| load_date| row_num|
+---+----------+----------+----------+--------+
|111|2019-03-03|2019-03-03|2019-03-03| 1|
|111|2019-02-02|2019-02-02|2019-02-02| 2|
|111|2019-01-01|2019-01-01|2019-01-01| 3|
|222|2019-03-03|2019-03-03|2019-03-03| 1|
|222|2019-01-01|2019-01-01|2019-01-01| 2|
|333|2019-02-02|2019-02-02|2019-02-02| 1|
|333|2019-01-01|2019-01-01|2019-01-01| 2|
|444|2019-02-02|2019-02-02|2019-02-02| 1|
|555|2019-03-03|2019-03-03|2019-03-03| 1|
+---+----------+----------+----------+--------+
一旦有了上表,然后使用如下所示的 LAG
函数
Once you have the above table then using LAG
function like below
select pk, from_d,
case when row_num = 1 then to_d else date_sub(lag(to_d) over (), 1) end as to_d,
row_num from table;
这将为您提供理想的结果
This will give you the desired result
+---+----------+----------+-------------------+
| pk| from_d| to_d|row_number_window_0|
+---+----------+----------+-------------------+
|111|2019-03-03|2019-03-03| 1|
|111|2019-02-02|2019-03-02| 2|
|111|2019-01-01|2019-02-01| 3|
|222|2019-03-03|2019-03-03| 1|
|222|2019-01-01|2019-03-02| 2|
|333|2019-02-02|2019-02-02| 1|
|333|2019-01-01|2019-02-01| 2|
|444|2019-02-02|2019-02-02| 1|
|555|2019-03-03|2019-03-03| 1|
+---+----------+----------+-------------------+
希望这对您有帮助
这篇关于根据配置单元表中的其他表值更新列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!