两个相同的组合键不能达到相同的缩减器 [英] Two equal combine keys do not get to the same reducer

查看:194
本文介绍了两个相同的组合键不能达到相同的缩减器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我只在输入和输出中使用文本键和值。我使用MapReduce框架在Java中制作Hadoop应用程序。我使用一个组合器来执行额外的计算步骤,然后再减少到最终输出。

但我遇到的问题是按键不能使用同一个缩减器。
我在组合器中创建并添加这样的键/值对:

  public static class Step4Combiner extends Reducer<文本,文本,文本,文本> {
private static Text key0 = new Text();
private static Text key1 = new Text();
$ b $ public void reduce(Text key,Iterable< Text> values,Context context)throws IOException,InterruptedException {
key0.set(KeyOne);
key1.set(KeyTwo);
context.write(key0,new Text(some value));
context.write(key1,new Text(some other value));
}

}

public static class Step4Reducer扩展了Reducer<文本,文本,文本,文本> {
$ b $ public void reduce(Text key,Iterable< Text> values,Context context)throws IOException,InterruptedException {
System.out.print(Key:+ key.toString() +Value:);
String theOutput =;
for(Text val:values){
System.out.print(,+ val);
}
System.out.print(\\\
);

context.write(key,new Text(theOutput));
}

}

主要创建工作像这样:

  Configuration conf = new Configuration(); 
String [] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();

工作job4 =新工作(conf,Step 4);
job4.setJarByClass(Step4.class);

job4.setMapperClass(Step4.Step4Mapper.class);
job4.setCombinerClass(Step4.Step4Combiner.class);
job4.setReducerClass(Step4.Step4Reducer.class);

job4.setInputFormatClass(TextInputFormat.class);
job4.setOutputKeyClass(Text.class);
job4.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job4,new Path(outputPath));
FileOutputFormat.setOutputPath(job4,new Path(finalOutputPath));

System.exit(job4.waitForCompletion(true)?0:1);

从缩减器打印的stdout中的输出是这样的:

 键:KeyOne值:,一些值
键:KeyTwo值:,其他值
键:KeyOne值:,某值
Key:KeyTwo Value:,一些其他值
Key:KeyOne Value:,一些值
Key:KeyTwo Value:,一些其他值

这是没有意义的,因为键是相同的,因此它应该是2个reducer,其中3个具有相同的值,它是Iterable
$

解决方案

这很可能是因为你的组合器运行在map和reduce阶段(一个鲜为人知的'特性')。

基本上你正在修改组合器中的键,它可能会或可能不会以地图输出在缩减器中合并在一起。合成器运行后(减少边),键通过分组比较器来确定返回Iterable传递给reduce方法的值(我绕过了reduce阶段的流式方面 - 迭代器不支持如果分组比较器确定当前键和最后一个键是相同的,则返回true)



您可以通过检查上下文(有一个 Context.getTaskAttempt()。isMap()方法尝试并检测当前的组合器阶段(map或reduce)有一些记忆,这也是有问题的,甚至可能有一个关于这个地方的JIRA票)。



底线,不要修改组合键,除非如果组合器正在运行减少一方,您可以绕过这个避难

编辑
所以调查@ Amar的评论,我把一些代码( pastebin链接),其中添加了一些详细的比较器,组合器,缩减器等。如果您运行单个映射作业,那么在缩小阶段不会运行组合器,并且map输出将不会被重新排序,因为它已经被假定为排序了。


假定它被排序,因为它在发送到组合器类之前被排序,并假定这些密钥将不会被触及 - 因此仍然被分类。记住一个组合器是用来组合一个给定键的值的。

因此,对于单个映射和给定的组合器,缩减器可以看到KeyOne,KeyTwo,KeyOne中的键, KeyTwo,KeyOne命令。分组比较器看到它们之间的转换,因此您可以对reduce函数进行6次调用。如果使用两个映射器,那么reducer知道它有两个排序的段(一个),因此仍然需要在减少之前对它们进行排序 - 但由于分段数量低于阈值,所以按照内联流排序(再次假定排序分段)完成排序。您仍然是两个映射器(从reduce阶段输出10条记录)的错误输出。

所以,不要修改组合器中的键,这不是
。该组合器的用途是什么。

I'm making a Hadoop application in Java with the MapReduce framework.

I use only Text keys and values for both input and output. I use a combiner to do an extra step of computations before reducing to the final output.

But I have the problem that the keys do not go to the same reducer. I create and add the key/value pair like this in the combiner:

public static class Step4Combiner extends Reducer<Text,Text,Text,Text> {
    private static Text key0 = new Text();
    private static Text key1 = new Text();

        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                key0.set("KeyOne");
                key1.set("KeyTwo");
                context.write(key0, new Text("some value"));
                context.write(key1, new Text("some other value"));
        }

}   

public static class Step4Reducer extends Reducer<Text,Text,Text,Text> {

            public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                System.out.print("Key:" + key.toString() + " Value: ");
                String theOutput = "";
                for (Text val : values) {
                    System.out.print("," + val);
                }
                System.out.print("\n");

                context.write(key, new Text(theOutput));
            }

}

In the main i creates the job like this:

Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

Job job4 = new Job(conf, "Step 4");
job4.setJarByClass(Step4.class);

job4.setMapperClass(Step4.Step4Mapper.class);
job4.setCombinerClass(Step4.Step4Combiner.class);
job4.setReducerClass(Step4.Step4Reducer.class);

job4.setInputFormatClass(TextInputFormat.class);
job4.setOutputKeyClass(Text.class);
job4.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job4, new Path(outputPath));
FileOutputFormat.setOutputPath(job4, new Path(finalOutputPath));            

System.exit(job4.waitForCompletion(true) ? 0 : 1);

The output in stdout printed from the reducer is this:

Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value
Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value
Key:KeyOne Value: ,some value
Key:KeyTwo Value: ,some other value

Which makes no sense since the keys are the same, and therefore it should be 2 reducers with 3 of the same values in it's Iterable

Hope you can help me get to the bottom of this :)

解决方案

This is most probably because your combiner is running in both map and reduce phases (a little known 'feature').

Basically you are amending the key in the combiner, which may or may not run as map outputs are merged together in the reducer. After the combiner is run (reduce side), the keys are fed through the grouping comparator to determine what values back the Iterable passed to the reduce method (i'm skirting around the streaming aspect of the reduce phase here - the iterable is not backed by a set or list of values, more calls to iterator().next() return true if the grouping comparator detemines the current key and the last key are the same)

You can try and detect the current combiner phase side (map or reduce) by inspecting the Context (there is a Context.getTaskAttempt().isMap() method, but i have some memory of this being problematic too, and there even might be a JIRA ticket about this somewhere).

Bottom line, don't amend the key in the combiner unless you can find away to bypass this bevaviour if the combiner is running reduce side.

EDIT So investigating @Amar's comment, i put together some code (pastebin link) which adds in some verbose comparators, combiners, reducers etc. If you run a single map job then in the reduce phase no combiner will run, and map output will not be sorted again as it is already assumed to be sorted.

It is assumed to be sorted as it is sorted prior to being sent into the combiner class, and it assumed that the keys will come out untouched - hence still sorted. Remember a Combiner is meant to Combine values for a given key.

So with a single map and the given combiner, the reducer sees the keys in KeyOne, KeyTwo, KeyOne, KeyTwo, KeyOne order. The grouping comparator sees a transition between them and hence you get 6 calls to the reduce function

If you use two mappers, then the reducer knows it has two sorted segments (one from each map), and so still needs to sort them prior to reducing - but because the number of segments is below a threshold, the sort is done as an inline stream sort (again the segments are assumed to be sorted). You still be the wrong output with two mappers (10 records output from the reduce phase).

So again, don't amend the key in the combiner, this is not what the combiner is intended for.

这篇关于两个相同的组合键不能达到相同的缩减器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆