Hive:在主表上进行增量更新的最佳方式 [英] Hive: Best way to do incremetal updates on a main table

查看:83
本文介绍了Hive:在主表上进行增量更新的最佳方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我在 Hive 中有一个主表,它将存储我所有的数据.

So I have a main table in Hive, it will store all my data.

我希望能够加载大约每个月的增量数据更新拥有大量数据,数十亿行.会有新数据以及更新的条目.

I want to be able to load a incremental data update about every month with a large amount of data couple billion rows. There will be new data as well as updated entries.

解决这个问题的最佳方法是什么,我知道 Hive 最近升级并支持更新/插入/删除.

What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.

我一直在想的是以某种方式找到将要更新的条目并将它们从主表中删除,然后插入新的增量更新.但是在尝试之后,插入非常快,但删除非常慢.

What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.

另一种方式是使用update语句做一些事情来匹配来自主表的键值和增量更新并更新它们的字段.我还没有试过这个.这听起来也很慢,因为 Hive 必须逐一更新每个条目.

The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.

有人对如何最有效地做到这一点有任何想法吗??总的来说,我对 Hive 和数据库很陌生.

Anyone got any ideas as to how to do this most efficiently and effectively ?? I'm pretty new to Hive and databases in general.

推荐答案

If 在 ACID 模式下合并 不适用,然后可以使用 FULL OUTER JOIN 或使用 UNION ALL + row_number 进行更新.要查找将要更新的所有条目,您可以将增量数据与旧数据合并:

If merge in ACID mode is not applicable, then it's possible to update using FULL OUTER JOIN or using UNION ALL + row_number. To find all entries that will be updated you can join increment data with old data:

insert overwrite target_data [partition() if applicable]
SELECT
  --select new if exists, old if not exists
  case when i.PK is not null then i.PK   else t.PK   end as PK,
  case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
  ... 
  case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
  FROM 
      target_data t --restrict partitions if applicable
      FULL JOIN increment_data i on (t.PK=i.PK); 

可以通过限制 target_data 中的分区来优化这一点,这些分区将使用 WHERE partition_col in (select distinct partition_col from increment_data) 或将分区列表作为参数传递并在where 子句,它会工作得更快.

It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined using WHERE partition_col in (select distinct partition_col from increment_data) or pass partition list if possible as a parameter and use in the where clause, it will work even faster.

此外,如果您想用新数据更新所有列,您可以使用 UNION ALL+row_number() 应用此解决方案,它比完全连接更快:https://stackoverflow.com/a/44755825/2700344

Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(), it works faster than full join: https://stackoverflow.com/a/44755825/2700344

这篇关于Hive:在主表上进行增量更新的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆