outputcollector如何工作? [英] How outputcollector works?
问题描述
我试图分析默认的map reduce作业,它没有定义mapper或reducer。
,即使用IdentityMapper& IdentityReducer
为了使我自己清楚,我只写了我的身份缩减器
public static class MyIdentityReducer扩展MapReduceBase实现Reducer< Text,文本,文本,文本和GT; {
@Override
public void reduce(Text key,Iterator< Text> values,
OutputCollector< Text,Text>输出,Reporter记者)
抛出IOException {
while(values.hasNext()){
Text value = values.next();
output.collect(key,value);
}
}
}
我的输入文件是:
$ hadoop fs -cat NameAddress.txt
Dravid Banglore
Sachin Mumbai
Dhoni Ranchi
Dravid Jaipur
Dhoni Chennai
Sehwag Delhi
Gambhir Delhi
Gambhir Calcutta
我期待
Dravid Jaipur
Dhoni Chennai
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi
我有
$ hadoop fs -cat NameAddress / part-00000
Dhoni Ranchi
Dhoni Chennai
Dravid Banglore
Dravid Jaipur
Gambhir Delhi
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi
$ b我认为由于汇编是由程序员在reducer的while循环中完成的,然后写入outputcollector。我觉得减速器的键传递给输出集合器的键总是唯一的&因为这里如果我不聚合,最后一个键的值将覆盖以前的值。显然不是这样。
有人可能会给我一个更好的输出集线器,它是如何工作的以及它如何处理所有的键。我在hadoop src代码中看到了很多outputcollector的实现。
我可以编写我自己的outputcollector,它可以完成我期望的功能吗?
解决方案并且每个对reducer的调用都有一个唯一的键值和一个与该键相关的所有值的迭代。你正在做的是遍历所有传入的值并写出每个值。
因此,在您的情况下可能会有比数据更少的呼叫并不重要。你仍然最终写出所有的值。
I was trying to analyse the default map reduce job, that doesn't define a mapper or a reducer. i.e. one that uses IdentityMapper & IdentityReducer To make myself clear I just wrote my identity reducer
public static class MyIdentityReducer extends MapReduceBase implements Reducer<Text,Text,Text,Text> { @Override public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { while(values.hasNext()) { Text value = values.next(); output.collect(key, value); } } }
My input file was :
$ hadoop fs -cat NameAddress.txt Dravid Banglore Sachin Mumbai Dhoni Ranchi Dravid Jaipur Dhoni Chennai Sehwag Delhi Gambhir Delhi Gambhir Calcutta I was expecting Dravid Jaipur Dhoni Chennai Gambhir Calcutta Sachin Mumbai Sehwag Delhi I got $ hadoop fs -cat NameAddress/part-00000 Dhoni Ranchi Dhoni Chennai Dravid Banglore Dravid Jaipur Gambhir Delhi Gambhir Calcutta Sachin Mumbai Sehwag Delhi
I was of the opinion that since the aggregations are done by the programmer in the while loop of the reducer and then written to the outputcollector. I was of the impression that the keys of the reducer passed to outputcollector are always unique & since here if i don't aggregate, the last key's values overrides the previous value. Clearly its not the case. Could someone please give me a better insite of the outputcollector, how it works and how it handles all the keys. I see many implementations of outputcollector in the hadoop src code. Can i write my own outputcollector that can do what i am expecting?
解决方案The keys are unique for the reducer and each call to the reducer has a key value that's unique and an iterable of all values associated with that key. What you're doing is iterating over all of the values passed in and writing out each one.
So it doesn't matter that there might be fewer calls than data in your case. You still end up writing all of the values out.
这篇关于outputcollector如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!