outputcollector如何工作? [英] How outputcollector works?

查看:220
本文介绍了outputcollector如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图分析默认的map reduce作业,它没有定义mapper或reducer。
,即使用IdentityMapper& IdentityReducer
为了使我自己清楚,我只写了我的身份缩减器

  public static class MyIdentityReducer扩展MapReduceBase实现Reducer< Text,文本,文本,文本和GT; {
@Override
public void reduce(Text key,Iterator< Text> values,
OutputCollector< Text,Text>输出,Reporter记者)
抛出IOException {
while(values.hasNext()){
Text value = values.next();
output.collect(key,value);
}
}
}

我的输入文件是:

  $ hadoop fs -cat NameAddress.txt 
Dravid Banglore
Sachin Mumbai
Dhoni Ranchi
Dravid Jaipur
Dhoni Chennai
Sehwag Delhi
Gambhir Delhi
Gambhir Calcutta

我期待
Dravid Jaipur
Dhoni Chennai
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi

我有
$ hadoop fs -cat NameAddress / part-00000
Dhoni Ranchi
Dhoni Chennai
Dravid Banglore
Dravid Jaipur
Gambhir Delhi
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi


$ b我认为由于汇编是由程序员在reducer的while循环中完成的,然后写入outputcollector。我觉得减速器的键传递给输出集合器的键总是唯一的&因为这里如果我不聚合,最后一个键的值将覆盖以前的值。显然不是这样。
有人可能会给我一个更好的输出集线器,它是如何工作的以及它如何处理所有的键。我在hadoop src代码中看到了很多outputcollector的实现。
我可以编写我自己的outputcollector,它可以完成我期望的功能吗?

解决方案

并且每个对reducer的调用都有一个唯一的键值和一个与该键相关的所有值的迭代。你正在做的是遍历所有传入的值并写出每个值。

因此,在您的情况下可能会有比数据更少的呼叫并不重要。你仍然最终写出所有的值。


I was trying to analyse the default map reduce job, that doesn't define a mapper or a reducer. i.e. one that uses IdentityMapper & IdentityReducer To make myself clear I just wrote my identity reducer

public static class MyIdentityReducer extends MapReduceBase implements Reducer<Text,Text,Text,Text> {
        @Override
        public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {
            while(values.hasNext()) {
                Text value = values.next();
                output.collect(key, value);
            }
        }   
    }

My input file was :

$ hadoop fs -cat NameAddress.txt
Dravid Banglore
Sachin Mumbai
Dhoni Ranchi
Dravid Jaipur
Dhoni Chennai
Sehwag Delhi
Gambhir Delhi
Gambhir Calcutta

I was expecting
Dravid Jaipur
Dhoni Chennai
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi

I got
$ hadoop fs -cat NameAddress/part-00000
Dhoni   Ranchi
Dhoni   Chennai
Dravid  Banglore
Dravid  Jaipur
Gambhir Delhi
Gambhir Calcutta
Sachin  Mumbai
Sehwag  Delhi

I was of the opinion that since the aggregations are done by the programmer in the while loop of the reducer and then written to the outputcollector. I was of the impression that the keys of the reducer passed to outputcollector are always unique & since here if i don't aggregate, the last key's values overrides the previous value. Clearly its not the case. Could someone please give me a better insite of the outputcollector, how it works and how it handles all the keys. I see many implementations of outputcollector in the hadoop src code. Can i write my own outputcollector that can do what i am expecting?

解决方案

The keys are unique for the reducer and each call to the reducer has a key value that's unique and an iterable of all values associated with that key. What you're doing is iterating over all of the values passed in and writing out each one.

So it doesn't matter that there might be fewer calls than data in your case. You still end up writing all of the values out.

这篇关于outputcollector如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆