替代hadoop提供的默认hashpartioner [英] Alternative to the default hashpartioner provided with hadoop

查看:299
本文介绍了替代hadoop提供的默认hashpartioner的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个hadoop MapReduce程序,它不均衡地分配密钥。
一些减速器最终有两把钥匙,一些钥匙和一些钥匙。
我该如何强制hadoop将每个分区与某个键分配给一个单独的缩减器。我有九种独特的表格键:

  0,0 
0,1
0,2
1,0
1,1
1,2
2,0
2,1
2,2

并设置了job.setNumReduceTasks(9);
,但散列分配器似乎将两个键散列到相同的散列码,导致重叠键被发送到相同的缩减器,并使一些缩减器闲置。

解决这个?它会将每个唯一密钥发送给一个随机减速器,以确保每个减速器接收一个密钥。如何启用它并取代默认值?



编辑:

有人可以解释为什么我的输出看起来像像

  -rw-r  -  r-- 1个用户supergroup 0 2018-04-19 18:58 outbin9 / _SUCCESS 
drwxr-xr-x - 用户supergroup 0 2018-04-19 18:57 outbin9 / _logs
-rw-r - r-- 1个用户超级组869 2018-04-19 18:57 outbin9 / part-r-00000
-rw-r - r-- 1个用户超级组1562 2018-04-19 18:57 outbin9 / part-r-00001
-rw-r - r-- 1 user supergroup 913 2018-04-19 18:58 outbin9 / part-r-00002
-rw-r - r-- 1个用户超级组1771 2018-04-19 18:58 outbin9 / part-r- 00003
-rw-r - r-- 1个用户超级组979 2018-04-19 18:58 outbin9 / part-r-00004
-rw-r - r-- 1个用户超级组880 2018-04-19 18:58 outbin9 / part-r-00005
-rw-r - r-- 1个用户supergroup 0 2018-04-19 18:58 outbin9 / part-r-00006
-rw-r - r-- 1个用户supergroup 0 2018-04-19 18:58 outbin9 / part-r-00007
-rw-r - r-- 1个用户超级组726 2018-04-19 18:58 outbin9 / part-r-00008

较大的部分组r-00001和部分r-00003分别接收到1,0和2,2 / 0,0和1,2的键。并注意到part-r-00006和part-r-00007是空的。

解决方案 是Hadoop中的默认分区程序,它为每个唯一的键创建一个Reduce任务。如果用户有兴趣存储特定的一组关键字导致不同的缩减器,那么用户可以编写他自己的分区器实现。它可以是一般用途,也可以是您希望用于用户应用程序的特定数据类型或值。

自定义分区程序是一个过程,允许您根据用户条件将结果存储在不同的简化器中。通过将分区器设置为通过键进行分区,我们可以保证,同一个键的记录将转到同一个reducer。分区程序确保只有一个reducer接收该特定键的所有记录。



示例示例链接


I have a hadoop MapReduce program that distributes keys unevenly. Some reducers end up with two keys, some with one key, and some with none. how do I force hadoop to distribute each partition with a certain key to a separate reducer. I have nine unique keys of the form:

0,0
0,1
0,2
1,0
1,1
1,2
2,0
2,1
2,2

and I set the job.setNumReduceTasks(9); but the hashpartitioner seems to hash two keys to the same hashcode causing overlapped keys being sent to the same reducer and leaving some reducers idle.

Does a random partitioner resolve this? will it send each unique key to a random reducer guaranteeing each reducer receives a single key. How do I enable it and replace the default?

EDIT:

can someone please explain why my output looks like

-rw-r--r--   1 user supergroup          0 2018-04-19 18:58 outbin9/_SUCCESS
drwxr-xr-x   - user supergroup          0 2018-04-19 18:57 outbin9/_logs
-rw-r--r--   1 user supergroup        869 2018-04-19 18:57 outbin9/part-r-00000
-rw-r--r--   1 user supergroup       1562 2018-04-19 18:57 outbin9/part-r-00001
-rw-r--r--   1 user supergroup        913 2018-04-19 18:58 outbin9/part-r-00002
-rw-r--r--   1 user supergroup       1771 2018-04-19 18:58 outbin9/part-r-00003
-rw-r--r--   1 user supergroup        979 2018-04-19 18:58 outbin9/part-r-00004
-rw-r--r--   1 user supergroup        880 2018-04-19 18:58 outbin9/part-r-00005
-rw-r--r--   1 user supergroup          0 2018-04-19 18:58 outbin9/part-r-00006
-rw-r--r--   1 user supergroup          0 2018-04-19 18:58 outbin9/part-r-00007
-rw-r--r--   1 user supergroup        726 2018-04-19 18:58 outbin9/part-r-00008

The larger groups part-r-00001 and part-r-00003 have received keys 1,0 and 2,2 / 0,0 and 1,2 respectively. And notice that part-r-00006 and part-r-00007 are empty.

解决方案

HashPartitioner is the default partitioner in Hadoop, which creates one Reduce task for each unique "key". All the values with the same key goes to the same instance of your reducer, in a single call to the reduce function.

If user is interested to store a particular group of results in different reducers, then the user can write his own partitioner implementation. It can be general purpose or custom made to the specific data types or values that you expect to use in user application.

Custom Partitioner is a process that allows you to store the results in different reducers, based on the user condition. By setting a partitioner to partition by the key, we can guarantee that, records for the same key will go to the same reducer. A partitioner ensures that only one reducer receives all the records for that particular key.

sample example link

这篇关于替代hadoop提供的默认hashpartioner的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆