蜂巢中的增量/增量加载 [英] Delta/Incremental Load in Hive

查看:107
本文介绍了蜂巢中的增量/增量加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下面的用例:



我的应用程序在 RDBMS DB中有一个包含多年数据 。我们使用 sqoop 将数据导入HDFS,并将其加载到由年,月分区的配置单元表中。 现在,应用程序会更新,并且每天还将新记录插入到RDBMS表格表中。这些更新的记录可以跨历史月份。更新的记录和新的插入记录可以通过更新的时间戳字段确定(它将有当前的日期时间戳记)。现在,问题在于:如何进行增量/增量加载hive表每天使用这些更新的记录。

- >我知道有一个sqoop功能允许增量导入。但是,只有新的增量导入对我们来说是不够的。



因为 -



- >我不能直接插入这些记录(使用插入)在配置单元表中,因为它会导致重复记录(更新记录)。

- >同样的方式,我不能使用插入覆盖语句为这些只是更新并插入跨越多个月的记录。插入覆盖将删除以前的记录。



当然,最简单的选择是每天使用sqoop获取完整数据,但我们不希望数据量很大。



因此,基本上我们只想完全加载那些我们收到更新/插入记录的分区。



我们愿意探索在蜂巢或sqoop结束时的选择。您可以让我们知道吗?



预先致谢。

>对于任何基于Hive的系统来说,更新是一个众所周知的难题。

一个典型的方法是两步处理过程


  1. 将所有已更改的数据插入到一个表中。正如你所说的,当更新行时,这会导致重复。

  2. 定期用第一个表中的非重复数据覆盖第二个表。

第二步可能很痛苦,但实际上没有办法。在某种程度上,您必须重写,因为Hive不会进行就地更新。尽管如此,根据您的数据,您可能能够巧妙地对表格进行分区,以避免完全覆盖。例如,如果第1步只插入少数几个分区,那么只有这些分区需要被覆盖到第二个表中。



另外,根据访问模式,只需要将第二个非重复表视为一个视图并且根本不实现它就可以了。通常这只是延迟了查询时间的痛苦。



我见过的唯一方法是使用非常自定义的输入和输出格式。而不是全部解释,你可以在这里阅读: http://pkghosh.wordpress.com/2012/07/08/making-hive-squawk-like-a-real-database/

Owen O'Malley也一直致力于将这个想法的一个版本添加到标准Hive中,但这仍在开发中: https://issues.apache.org/jira/browse/HIVE-5317


I have the use case below :

My application has a table having multiyear data in RDBMS DB. We have used sqoop to get data into HDFS and have loaded into hive table partitioned by year, month.

Now, the application updates, and inserts new records into RDBMS Table table daily as well. These updated records can span across history months. Updated records and new insert records can be determined by updated timestamp field (it will have current day timestamp).

Now the problem here is : how to do delta/incremental load hive table daily using these updated records.

-> I know there is a sqoop functionality which allows incremental imports. But, only new incremental import is not enough for us.

Because -

-> I can not directly insert these records (using insert into) in hive table because it will result in duplicate records (updated records).

-> Same way I can not use insert overwrite statement as these are just update and insert records spanning across multiple month. Insert overwrite will delete earlier records.

Of course, easiest option is to get full data using sqoop daily but we don't want to do it as data volume is large.

So , basically we want to fully load only those partitions for which we have received update/insert records.

We are open to explore option at hive or sqoop end. Can you please let us know?

Thanks in advance.

解决方案

Updates are a notoriously difficult problem for any Hive-based system.

One typical approach is a two-step process

  1. Insert any data that has changed into one table. As you said, this will result in duplicates when rows are updated.
  2. Periodically overwrite a second table with "de-duplicated" data from the first table.

The second step is potentially painful, but there's really no way around it. At some level, you have to be overwriting, since Hive doesn't do in-place updating. Depending on your data, you may be able to partition the tables cleverly enough to avoid doing full overwrites, though. For example, if step 1 only inserts into a handful of partitions, then only those partitions need to be overwritten into the second table.

Also, depending on the access pattern, it can make sense to just have the second "de-duplicated" table be a view and not materialize it at all. Usually this just delays the pain to query time, though.

The only other way round this I've seen is using a very custom input and output format. Rather than explain it all, you can read about it here: http://pkghosh.wordpress.com/2012/07/08/making-hive-squawk-like-a-real-database/

Owen O'Malley has also been working on adding a version of this idea to standard Hive, but that's still in development: https://issues.apache.org/jira/browse/HIVE-5317

这篇关于蜂巢中的增量/增量加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆