Hive 中的增量/增量负载 [英] Delta/Incremental Load in Hive

查看:33
本文介绍了Hive 中的增量/增量负载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下用例:

我的应用程序在 RDBMS 数据库中有一个包含 多年数据 的表.我们使用 sqoop 将数据导入 HDFS 并加载到按年、月分区的 hive 表中.

My application has a table having multiyear data in RDBMS DB. We have used sqoop to get data into HDFS and have loaded into hive table partitioned by year, month.

现在,应用程序也会每天更新并将新记录插入到 RDBMS 表中.这些更新的记录可以跨越历史月份.更新的记录和新的插入记录可以通过更新的时间戳字段来确定(它将具有当前日期时间戳).

Now, the application updates, and inserts new records into RDBMS Table table daily as well. These updated records can span across history months. Updated records and new insert records can be determined by updated timestamp field (it will have current day timestamp).

现在的问题是:如何使用这些更新的记录每天执行增量/增量加载配置单元表.

Now the problem here is : how to do delta/incremental load hive table daily using these updated records.

-> 我知道有一个允许增量导入的 sqoop 功能.但是,对于我们来说,只有新的增量导入是不够的.

-> I know there is a sqoop functionality which allows incremental imports. But, only new incremental import is not enough for us.

因为——

->我不能直接在hive表中插入这些记录(使用insert into),因为这会导致重复记录(更新记录).

-> I can not directly insert these records (using insert into) in hive table because it will result in duplicate records (updated records).

-> 同样的方式我不能使用插入覆盖语句,因为这些只是跨越多个月的更新和插入记录.插入覆盖将删除较早的记录.

-> Same way I can not use insert overwrite statement as these are just update and insert records spanning across multiple month. Insert overwrite will delete earlier records.

当然,最简单的选择是每天使用 sqoop 获取完整数据,但我们不想这样做,因为数据量很大.

Of course, easiest option is to get full data using sqoop daily but we don't want to do it as data volume is large.

所以,基本上我们只想完全加载我们收到更新/插入记录的那些分区.

So , basically we want to fully load only those partitions for which we have received update/insert records.

我们愿意探索 hive 或 sqoop 端的选项.你能告诉我们吗?

We are open to explore option at hive or sqoop end. Can you please let us know?

提前致谢.

推荐答案

对于任何基于 Hive 的系统来说,更新都是一个众所周知的难题.

Updates are a notoriously difficult problem for any Hive-based system.

一种典型的方法是两步过程

One typical approach is a two-step process

  1. 将已更改的任何数据插入到一个表中.正如您所说,这将在更新行时导致重复.
  2. 定期用第一个表中的去重"数据覆盖第二个表.

第二步可能会很痛苦,但真的没有办法绕过它.在某种程度上,您必须进行覆盖,因为 Hive 不进行就地更新.不过,根据您的数据,您可能能够足够巧妙地对表进行分区以避免完全覆盖.例如,如果步骤 1 只插入少数分区,那么只有那些分区需要覆盖到第二个表中.

The second step is potentially painful, but there's really no way around it. At some level, you have to be overwriting, since Hive doesn't do in-place updating. Depending on your data, you may be able to partition the tables cleverly enough to avoid doing full overwrites, though. For example, if step 1 only inserts into a handful of partitions, then only those partitions need to be overwritten into the second table.

此外,根据访问模式,将第二个重复数据删除"表作为视图而不将其具体化是有意义的.不过,通常这只会延迟查询时间的痛苦.

Also, depending on the access pattern, it can make sense to just have the second "de-duplicated" table be a view and not materialize it at all. Usually this just delays the pain to query time, though.

我见过的唯一另一种方法是使用非常自定义的输入和输出格式.您可以在此处阅读,而不是全部解释:http://pkghosh.wordpress.com/2012/07/08/making-hive-squawk-like-a-real-database/

The only other way round this I've seen is using a very custom input and output format. Rather than explain it all, you can read about it here: http://pkghosh.wordpress.com/2012/07/08/making-hive-squawk-like-a-real-database/

Owen O'Malley 也一直致力于将这个想法的一个版本添加到标准 Hive,但仍在开发中:https://issues.apache.org/jira/browse/HIVE-5317

Owen O'Malley has also been working on adding a version of this idea to standard Hive, but that's still in development: https://issues.apache.org/jira/browse/HIVE-5317

这篇关于Hive 中的增量/增量负载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆