Hadoop分区程序 [英] Hadoop partitioner

查看:100
本文介绍了Hadoop分区程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想问一下Hadoop分区程序,它是在Mappers中实现的吗?如何衡量使用默认哈希分区程序的性能-是否有更好的分区程序来减少数据偏斜?

I want to ask about Hadoop partitioner ,is it implemented within Mappers?. How to measure the performance of using the default hash partitioner - Is there better partitioner to reducing data skew?

谢谢

推荐答案

分区程序是Mappers和Reducers之间的关键组件.它将地图发出的数据分布在Reducers中.

Partitioner is a key component in between Mappers and Reducers. It distributes the maps emitted data among the Reducers.

Partitioner在每个Map Task JVM(Java进程)中运行.

Partitioner runs within every Map Task JVM (java process).

默认分区程序HashPartitioner基于哈希函数工作,并且与其他分区程序(例如TotalOrderPartitioner)相比,它的运行速度非常快.它在每个地图输出键上都运行哈希函数,即:

The default partitioner HashPartitioner works based on Hash function and it is very faster compared other partitioner like TotalOrderPartitioner. It runs hash function on every map output key i.e.:

Reduce_Number = (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

要检查Hash Partitioner的性能,请使用Reduce任务计数器,并查看在reducer之间的分配情况.

To check the performance of Hash Partitioner, use Reduce task counters and see how the distribution happened among the reducers.

哈希分区程序是基本的分区程序,不适用于处理偏斜度较高的数据.

Hash Partitioner is basic partitioner and it doesn't suit for processing data with high skewness.

要解决数据倾斜问题,我们需要编写自定义分区程序类,该类是从MapReduce API扩展Partitioner.java类的.

To address the data skew problems, we need to write the custom partitioner class extending Partitioner.java class from MapReduce API.

自定义分区程序的示例类似于RandomPartitioner.这是在简化器之间平均分配偏斜数据的最佳方法之一.

The example for custom partitioner is like RandomPartitioner. It is one of the best ways to distribute the skewed data among the reducers evenly.

这篇关于Hadoop分区程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆