HIVE如何根据某些条件更新现有数据(如果存在)以及如何插入新数据(如果不存在) [英] HIVE how to update the existing data if it exists based on some condition and insert new data if not exists

查看:899
本文介绍了HIVE如何根据某些条件更新现有数据(如果存在)以及如何插入新数据(如果不存在)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果要基于某些条件存在,我想更新现有数据(应更新优先级更高的数据),如果不存在,请插入新数据.

I want to update the existing data if it exists based on some condition(data with higher priority should be updated) and insert new data if not exists.

我已经为此写了一个查询,但是不知何故它在重复行数.这是我所拥有的以及我要实现的目标的完整说明:

I have already written a query for this but somehow it is duplicating the number of rows. Here is the full explanation of what I have and what I want to achieve:

我所拥有的: 表1-列-id,info,priority

What I have: Table 1 - columns - id,info,priority

hive> select * from sample1;
OK
1   123     1.01
2   234     1.02
3   213     1.03
5   213423  1.32
Time taken: 1.217 seconds, Fetched: 4 row(s)

表2:列-id,info,priority

Table 2: columns - id,info,priority

hive> select * from sample2;
OK
1   1234    1.05
2   23412   1.01
3   21      1.05
4   1232    1.1
2   3432423 1.6
3   34324   1.4

我想要的是决赛桌每个ID应该只有1行,并且根据最高优先级显示数据:

What I want is the final table should have only 1 row per id with the data according to the greatest priority:

1   1234    1.05
2   3432423 1.6
3   34324   1.4
4   1232    1.1
5   213423  1.32

我写的查询是这样的:

insert overwrite table sample1
select a.id,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from sample1 a
join 
sample2 b
on a.id=b.id where b.id in (select distinct(id) from sample1)
union all
select * from sample2 where id not in (select distinct(id) from sample1)
union all
select * from sample1 where id not in (select distinct(id) from sample2);

运行此查询后,我得到以下结果:

After running this query, I am getting this result:

hive> select * from sample1;
OK
1   1234    1.05
2   234     1.02
3   21      1.05
2   3432423 1.6
3   34324   1.4
5   213423  1.32
4   1232    1.1

如何修改当前查询以获得正确的结果.我是否可以遵循其他任何方法/过程来达到最终结果.我正在将hadoop 2.5.2和HIVE 1.2.1一起使用.我正在研究一个具有5个从属和1个NN的6节点集群.

How do I modify the present query to achieve the correct result. Is there any other method/process that I can follow to achieve the end result. I am using hadoop 2.5.2 along with HIVE 1.2.1 . I am working on a 6 node cluster with 5 slaves and 1 NN.

推荐答案

添加到以前的好答案中! 也可以尝试以下方法:

adding to previously good answers! try this also:

insert overwrite table UDB.SAMPLE1
select 
 COALESCE(id2,id )
,COALESCE(info2,info)
,COALESCE(priority2, priority)
from 
UDB.SAMPLE1 TAB1
full outer JOIN
(
select id2, info2, priority2
from
(
select 
 id       as id2
,info     as info2
,priority as priority2
,row_number() over (partition by id order by priority desc) rn
from UDB.SAMPLE2
)TAB2_wt
where TAB2_wt.rn =1
)TAB2
on TAB2.id2 = TAB1.id
;

select * from SAMPLE1;

+-----+----------+-----------+--+
| id  |   info   | priority  |
+-----+----------+-----------+--+
| 1   | 1234     | 1.05      |
| 2   | 3432423  | 1.6       |
| 3   | 34324    | 1.4       |
| 4   | 1232     | 1.1       |
| 5   | 213423   | 1.32      |
+-----+----------+-----------+--+

这篇关于HIVE如何根据某些条件更新现有数据(如果存在)以及如何插入新数据(如果不存在)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆