如何处理hadoop集群上的id生成? [英] How to handle id generation on a hadoop cluster?

查看:95
本文介绍了如何处理hadoop集群上的id生成?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在hadoop集群上构建一个字典,并且需要为每个标记生成一个数字标识。我应该怎么做?

解决方案

你有两个问题。首先,您要确保为每个令牌分配一个唯一的ID。要做到这一点,您应该通过令牌对记录进行排序和分组,并在减速器中进行分配。一旦确定了reducer方法对每个标记只调用一次,就可以使用上下文中的分区编号和reducer维护的唯一数字标识(每个分区一个实例) - 只需使用初始化为1的实例变量在设置方法中并在减少方法中递增。

I am building a dictionary on a hadoop cluster and need to generate a numeric id for each token. How should I do it?

解决方案

You have two problems. First you want to make sure that you assign exactly one id for each token. To do that you should sort and group records by token and make the assignment in a reducer. Once you've made sure that the reducer method is called exactly once for each token you can use the partition number from the context and a unique numeric id maintained by the reducer (one instance per partition) - just use an instance variable initialized to 1 in the setup method and incremented in the reduce method.

这篇关于如何处理hadoop集群上的id生成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆