从另一个配置单元表中刷新一个配置单元表 [英] Refresh one hive table from another hive table

查看:124
本文介绍了从另一个配置单元表中刷新一个配置单元表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个Hive表,我使用每小时增量导入的Sqoop从RDBMS引入,并将它们分级。我加入这些表格并创建新的维度表。每当我将RDBMS中的新行引入到Hive登台表中时,我都必须刷新维表。如果没有新行,则不应该执行暗表的刷新。我正在使用的配置单元版本没有ACID功能。



需要一些关于如何在配置单元中实现此功能的建议。

解决方案

您可以将新数据插入现有的Hive表中,就像其他数据库一样。 Hive还支持WHERE NOT EXISTS子句。

  INSERT INTO TABLE MyDim 
SELECT Id,Blah1,Blah2
FROM MySource s
WHERE NOT EXISTS
(SELECT 1 FROM MyDim z WHERE z.Id = s.Id)

但有一个问题:每个INSERT都会创建一个新的HDFS文件,即使涉及零记录。

每周的压缩作业将会很有帮助(例如,重命名碎片表,重新创建表,重命名后的INSERT OVERWRITE表,删除重命名)


I have a few Hive tables that i am bringing in from RDBMS using Sqoop incremental imports every hour and staging them. I am joining these tables and creating new dimension tables. Whenever i bring in new rows from RDBMS into Hive staging tables, I have to refresh the dimension tables. If there are no new rows, the refresh of dim tables should not be done. The hive version I'm using does not have ACID features.

Need some advice on how this could be achieved in hive.

解决方案

You can INSERT new data in existing Hive tables, like any other database. And Hive also supports the WHERE NOT EXISTS clause.

INSERT INTO TABLE MyDim
SELECT Id, Blah1, Blah2
FROM MySource s
WHERE NOT EXISTS
 (SELECT 1 FROM MyDim z WHERE z.Id =s.Id)

But there is a catch: each INSERT will create a new HDFS file, even when there are zero records involved. Too much fragmentation will reduce performance over time.

A weekly "compaction" job would be helpful (e.g. rename the fragmented table, re-CREATE the table, INSERT OVERWRITE from renamed table, drop renamed)

这篇关于从另一个配置单元表中刷新一个配置单元表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆