替代hadoop提供的默认hashpartioner [英] Alternative to the default hashpartioner provided with hadoop
问题描述
我有一个hadoop MapReduce程序,它不均衡地分配密钥。
一些减速器最终有两把钥匙,一些钥匙和一些钥匙。
我该如何强制hadoop将每个分区与某个键分配给一个单独的缩减器。我有九种独特的表格键:
0,0
0,1
0,2
1,0
1,1
1,2
2,0
2,1
2,2
并设置了job.setNumReduceTasks(9);
,但散列分配器似乎将两个键散列到相同的散列码,导致重叠键被发送到相同的缩减器,并使一些缩减器闲置。
解决这个?它会将每个唯一密钥发送给一个随机减速器,以确保每个减速器接收一个密钥。如何启用它并取代默认值?
编辑:
有人可以解释为什么我的输出看起来像像
-rw-r - r-- 1个用户supergroup 0 2018-04-19 18:58 outbin9 / _SUCCESS
drwxr-xr-x - 用户supergroup 0 2018-04-19 18:57 outbin9 / _logs
-rw-r - r-- 1个用户超级组869 2018-04-19 18:57 outbin9 / part-r-00000
-rw-r - r-- 1个用户超级组1562 2018-04-19 18:57 outbin9 / part-r-00001
-rw-r - r-- 1 user supergroup 913 2018-04-19 18:58 outbin9 / part-r-00002
-rw-r - r-- 1个用户超级组1771 2018-04-19 18:58 outbin9 / part-r- 00003
-rw-r - r-- 1个用户超级组979 2018-04-19 18:58 outbin9 / part-r-00004
-rw-r - r-- 1个用户超级组880 2018-04-19 18:58 outbin9 / part-r-00005
-rw-r - r-- 1个用户supergroup 0 2018-04-19 18:58 outbin9 / part-r-00006
-rw-r - r-- 1个用户supergroup 0 2018-04-19 18:58 outbin9 / part-r-00007
-rw-r - r-- 1个用户超级组726 2018-04-19 18:58 outbin9 / part-r-00008
较大的部分组r-00001和部分r-00003分别接收到1,0和2,2 / 0,0和1,2的键。并注意到part-r-00006和part-r-00007是空的。
自定义分区程序
是一个过程,允许您根据用户条件将结果存储在不同的简化器中。通过将分区器设置为通过键进行分区,我们可以保证,同一个键的记录将转到同一个reducer。分区程序确保只有一个reducer接收该特定键的所有记录。
I have a hadoop MapReduce program that distributes keys unevenly. Some reducers end up with two keys, some with one key, and some with none. how do I force hadoop to distribute each partition with a certain key to a separate reducer. I have nine unique keys of the form:
0,0
0,1
0,2
1,0
1,1
1,2
2,0
2,1
2,2
and I set the job.setNumReduceTasks(9); but the hashpartitioner seems to hash two keys to the same hashcode causing overlapped keys being sent to the same reducer and leaving some reducers idle.
Does a random partitioner resolve this? will it send each unique key to a random reducer guaranteeing each reducer receives a single key. How do I enable it and replace the default?
EDIT:
can someone please explain why my output looks like
-rw-r--r-- 1 user supergroup 0 2018-04-19 18:58 outbin9/_SUCCESS
drwxr-xr-x - user supergroup 0 2018-04-19 18:57 outbin9/_logs
-rw-r--r-- 1 user supergroup 869 2018-04-19 18:57 outbin9/part-r-00000
-rw-r--r-- 1 user supergroup 1562 2018-04-19 18:57 outbin9/part-r-00001
-rw-r--r-- 1 user supergroup 913 2018-04-19 18:58 outbin9/part-r-00002
-rw-r--r-- 1 user supergroup 1771 2018-04-19 18:58 outbin9/part-r-00003
-rw-r--r-- 1 user supergroup 979 2018-04-19 18:58 outbin9/part-r-00004
-rw-r--r-- 1 user supergroup 880 2018-04-19 18:58 outbin9/part-r-00005
-rw-r--r-- 1 user supergroup 0 2018-04-19 18:58 outbin9/part-r-00006
-rw-r--r-- 1 user supergroup 0 2018-04-19 18:58 outbin9/part-r-00007
-rw-r--r-- 1 user supergroup 726 2018-04-19 18:58 outbin9/part-r-00008
The larger groups part-r-00001 and part-r-00003 have received keys 1,0 and 2,2 / 0,0 and 1,2 respectively. And notice that part-r-00006 and part-r-00007 are empty.
HashPartitioner
is the default partitioner in Hadoop, which creates one Reduce task for each unique "key". All the values with the same key goes to the same instance of your reducer, in a single call to the reduce function.
If user is interested to store a particular group of results in different reducers, then the user can write his own partitioner implementation. It can be general purpose or custom made to the specific data types or values that you expect to use in user application.
Custom Partitioner
is a process that allows you to store the results in different reducers, based on the user condition. By setting a partitioner to partition by the key, we can guarantee that, records for the same key will go to the same reducer. A partitioner ensures that only one reducer receives all the records for that particular key.
这篇关于替代hadoop提供的默认hashpartioner的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!