Hive:在主表上执行增量更新的最佳方式 [英] Hive: Best way to do incremetal updates on a main table

查看:888
本文介绍了Hive:在主表上执行增量更新的最佳方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我希望能够每月加载增量数据更新
包含数十亿行的大量数据。将有新的数据
以及更新的条目。

最佳方式是什么?我知道Hive最近升级并支持更新/插入/删除。

我一直在想的是以某种方式找到将要更新的条目,并将它们从主表中删除,然后插入新的增量更新。然而,尝试这样做后,插入非常快,但删除非常缓慢。另一种方法是使用update语句来匹配主表中的键值和增量更新并更新其字段。我还没有尝试过。这听起来很痛苦,因为Hive必须将每个条目1更新为1.



任何人都会对如何最有效和最有效地做到这一点有任何想法?
我一般对Hive和数据库都很陌生。

解决方案

如果无法使用 MERGE 在ACID模式下更新,使用FULL OUTER JOIN进行更新。
要查找所有将要更新的条目,您需要使用旧数据连接增量数据:

  insert overwrite target_data [分区(如果适用的话)] 
SELECT
- 如果存在则选择new,如果不存在则为旧
当i.PK不为空时,则i.PK else t.PK结束为PK,
情况下,当i.PK不为空然后i.COL1其他t.COL1结束为COL1,
...
情况下,当i.PK不为空然后i.COL_n其他t。 COL_n以COL_n结尾
FROM
target_data t - 限制分区(如果适用)
FULL JOIN increment_data i on(t.PK = i.PK);

可以通过限制target_data中将被覆盖并加入的分区来优化此功能。



另外,如果您想用新数据更新所有列,可以使用 UNION ALL + row_number():< a href =https://stackoverflow.com/a/44755825/2700344> https://stackoverflow.com/a/44755825/2700344


So I have a main table in Hive, it will store all my data.

I want to be able to load a incremental data update about every month with a large amount of data couple billion rows. There will be new data as well as updated entries.

What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.

What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.

The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.

Anyone got any ideas as to how to do this most efficiently and effectively ?? I'm pretty new to Hive and databases in general.

解决方案

If you cannot update in ACID mode using MERGE then it's possible to update using FULL OUTER JOIN. To find all entries that will be updated you need to join increment data with old data:

insert overwrite target_data [partition() if applicable]
SELECT
  --select new if exists, old if not exists
  case when i.PK is not null then i.PK   else t.PK   end as PK,
  case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
  ... 
  case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
  FROM 
      target_data t --restrict partitions if applicable
      FULL JOIN increment_data i on (t.PK=i.PK); 

It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined.

Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(): https://stackoverflow.com/a/44755825/2700344

这篇关于Hive:在主表上执行增量更新的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆