Hive/Hadoop 中的唯一密钥生成 [英] Unique Key generation in Hive/Hadoop

查看:48
本文介绍了Hive/Hadoop 中的唯一密钥生成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在从大数据蜂巢表中选择一组记录时,需要为每条记录创建一个唯一的键.在顺序操作模式下,很容易通过调用像 max(id) 这样的 soem 来生成唯一的 id.由于 hive 并行运行任务,我们如何在不影响 hadoop 性能的情况下生成唯一键作为选择查询的一部分.这真的是一个 map reduce 问题还是我们需要采用顺序方法来解决这个问题.

While selecting a set of records from a big data hive table, a unique key needs to be created for each record. In a sequential mode of operation , it is easy to generate unique id by calling soem thing like max(id). Since hive runs the task in parallel, how can we generate unique key as part of a select query, without compromising the performance of hadoop. Is this really a map reduce problem or do we need to go for a sequential approach to solve this.

推荐答案

如果由于某种原因你不想处理 UUID,那么这个解决方案(基于数值)不需要你的并行单元交谈"彼此或同步.因此它非常有效,但并不能保证您的整数键将是连续的.

If by some reason you do not want to deal with UUIDs, then this solution (based on numeric values) does not require your parallel units to "talk" to each other or synchronize whatsoever. Thus it is very efficient, but it does not guarantee that your integer keys are going to be continuous.

如果你说 N 个并行执行单元,并且你知道你的 N,并且每个单元都被分配了一个从 0 到 N - 1 的 ID,那么你可以简单地在所有单元中生成一个唯一的整数

If you have say N parallel units of execution, and you know your N, and each unit is assigned an ID from 0 to N - 1, then you can simply generate a unique integer across all units

Unit #0:   0, N, 2N, 3N, ...
Unit #1:   1, N+1, 2N+1, 3N+1, ...
...
Unit #N-1: N-1, N+(N-1), 2N+(N-1), 3N+(N-1), ...

根据需要生成键的位置(映射器或化简器),您可以从 hadoop 配置中获取 N:

Depending on where you need to generate keys (mapper or reducer) you can get your N from hadoop configuration:

Mapper: mapred.map.tasks
Reduce: mapred.reduce.tasks

... 和您单位的 ID:在Java中,它是:

... and ID of your unit: In Java, it is:

 context.getTaskAttemptID().getTaskID().getId()

不确定 Hive,但也应该是可能的.

Not sure about Hive, but it should be possible as well.

这篇关于Hive/Hadoop 中的唯一密钥生成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆