HIVE如何根据某些条件更新现有数据(如果存在)以及如何插入新数据(如果不存在) [英] HIVE how to update the existing data if it exists based on some condition and insert new data if not exists
问题描述
如果要基于某些条件存在,我想更新现有数据(应更新优先级更高的数据),如果不存在,请插入新数据.
I want to update the existing data if it exists based on some condition(data with higher priority should be updated) and insert new data if not exists.
我已经为此写了一个查询,但是不知何故它在重复行数.这是我所拥有的以及我要实现的目标的完整说明:
I have already written a query for this but somehow it is duplicating the number of rows. Here is the full explanation of what I have and what I want to achieve:
我所拥有的: 表1-列-id,info,priority
What I have: Table 1 - columns - id,info,priority
hive> select * from sample1;
OK
1 123 1.01
2 234 1.02
3 213 1.03
5 213423 1.32
Time taken: 1.217 seconds, Fetched: 4 row(s)
表2:列-id,info,priority
Table 2: columns - id,info,priority
hive> select * from sample2;
OK
1 1234 1.05
2 23412 1.01
3 21 1.05
4 1232 1.1
2 3432423 1.6
3 34324 1.4
我想要的是决赛桌每个ID应该只有1行,并且根据最高优先级显示数据:
What I want is the final table should have only 1 row per id with the data according to the greatest priority:
1 1234 1.05
2 3432423 1.6
3 34324 1.4
4 1232 1.1
5 213423 1.32
我写的查询是这样的:
insert overwrite table sample1
select a.id,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from sample1 a
join
sample2 b
on a.id=b.id where b.id in (select distinct(id) from sample1)
union all
select * from sample2 where id not in (select distinct(id) from sample1)
union all
select * from sample1 where id not in (select distinct(id) from sample2);
运行此查询后,我得到以下结果:
After running this query, I am getting this result:
hive> select * from sample1;
OK
1 1234 1.05
2 234 1.02
3 21 1.05
2 3432423 1.6
3 34324 1.4
5 213423 1.32
4 1232 1.1
如何修改当前查询以获得正确的结果.我是否可以遵循其他任何方法/过程来达到最终结果.我正在将hadoop 2.5.2和HIVE 1.2.1一起使用.我正在研究一个具有5个从属和1个NN的6节点集群.
How do I modify the present query to achieve the correct result. Is there any other method/process that I can follow to achieve the end result. I am using hadoop 2.5.2 along with HIVE 1.2.1 . I am working on a 6 node cluster with 5 slaves and 1 NN.
推荐答案
添加到以前的好答案中! 也可以尝试以下方法:
adding to previously good answers! try this also:
insert overwrite table UDB.SAMPLE1
select
COALESCE(id2,id )
,COALESCE(info2,info)
,COALESCE(priority2, priority)
from
UDB.SAMPLE1 TAB1
full outer JOIN
(
select id2, info2, priority2
from
(
select
id as id2
,info as info2
,priority as priority2
,row_number() over (partition by id order by priority desc) rn
from UDB.SAMPLE2
)TAB2_wt
where TAB2_wt.rn =1
)TAB2
on TAB2.id2 = TAB1.id
;
select * from SAMPLE1;
+-----+----------+-----------+--+
| id | info | priority |
+-----+----------+-----------+--+
| 1 | 1234 | 1.05 |
| 2 | 3432423 | 1.6 |
| 3 | 34324 | 1.4 |
| 4 | 1232 | 1.1 |
| 5 | 213423 | 1.32 |
+-----+----------+-----------+--+
这篇关于HIVE如何根据某些条件更新现有数据(如果存在)以及如何插入新数据(如果不存在)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!