hadoop map 减少二次排序 [英] hadoop map reduce secondary sorting

查看:25
本文介绍了hadoop map 减少二次排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能解释一下hadoop中的二级排序是如何工作的?
为什么必须使用 GroupingComparator 以及它在 hadoop 中是如何工作的?

Can any one explain me how secondary sorting works in hadoop ?
Why must one use GroupingComparator and how does it work in hadoop ?

我正在浏览下面给出的链接,但对 groupcomapator 的工作原理产生了疑问.
谁能解释一下分组比较器是如何工作的?

I was going through the link given below and got doubt on how groupcomapator works.
Can any one explain me how grouping comparator works?

http://www.bigdataspeak.com/2013/02/hadoop-how-to-do-secondary-sort-on_25.html

推荐答案

分组比较器

一旦数据到达一个reducer,所有的数据都按key分组.由于我们有一个复合键,我们需要确保记录仅按自然键分组.这是通过编写自定义 GroupPartitioner 来实现的.我们有一个 Comparator 对象,它只考虑 TemperaturePair 类的 yearMonth 字段,以便将记录组合​​在一起.

Once the data reaches a reducer, all data is grouped by key. Since we have a composite key, we need to make sure records are grouped solely by the natural key. This is accomplished by writing a custom GroupPartitioner. We have a Comparator object only considering the yearMonth field of the TemperaturePair class for the purposes of grouping the records together.

public class YearMonthGroupingComparator extends WritableComparator {

    public YearMonthGroupingComparator() {
        super(TemperaturePair.class, true);
    }

    @Override
    public int compare(WritableComparable tp1, WritableComparable tp2) {
        TemperaturePair temperaturePair = (TemperaturePair) tp1;
        TemperaturePair temperaturePair2 = (TemperaturePair) tp2;
        return temperaturePair.getYearMonth().compareTo(temperaturePair2.getYearMonth());
    }
}

以下是运行二级排序作业的结果:

Here are the results of running our secondary sort job:

new-host-2:sbin bbejeck$ hdfs dfs -cat secondary-sort/part-r-00000

190101 -206

190101 -206

190102 -333

190102 -333

190103 -272

190103 -272

190104 -61

190105 -33

190106 44

190107 72

190108 44

190109 17

190110 -33

190111 -217

190111 -217

190112 -300

190112 -300

虽然按值对数据进行排序可能不是一个普遍的需求,但它是一个很好的工具,可以在需要时放在你的后兜里.此外,通过使用自定义分区器和组分区器,我们能够更深入地了解 Hadoop 的内部工作原理.另请参阅此链接..什么是hadoop map reduce中分组比较器的使用

While sorting data by value may not be a common need, it’s a nice tool to have in your back pocket when needed. Also, we have been able to take a deeper look at the inner workings of Hadoop by working with custom partitioners and group partitioners. Refer this link also..What is the use of grouping comparator in hadoop map reduce

这篇关于hadoop map 减少二次排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆