为什么我们需要在Hadoop程序中显式地设置输出键/值类? [英] Why do we need to set the output key/value class explicitly in the Hadoop program?

查看:101
本文介绍了为什么我们需要在Hadoop程序中显式地设置输出键/值类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Hadoop:The Definitive Guide一书中,有一个包含以下代码的示例程序。

In the "Hadoop : The Definitive Guide" book, there is a sample program with the below code.

JobConf conf = new JobConf(MaxTemperature.class);  
conf.setJobName("Max temperature");  
FileInputFormat.addInputPath(conf, new Path(args[0]));  
FileOutputFormat.setOutputPath(conf, new Path(args[1]));  
conf.setMapperClass(MaxTemperatureMapper.class);  
conf.setReducerClass(MaxTemperatureReducer.class);  
conf.setOutputKeyClass(Text.class);  
conf.setOutputValueClass(IntWritable.class);  

MR框架应该能够从Mapper和Reduce中找出输出键和值类在JobConf类中设置的函数。为什么我们需要在JobConf类上显式设置输出键和值类?此外,输入键/值对没有类似的API。

The MR framework should be able to figure out the output key and value class from the Mapper and the Reduce functions which are being set on the JobConf class. Why do we need to explicitly set the output key and value class on the JobConf class? Also, there is no similar API for the input key/value pair.

推荐答案

原因是类型擦除[1]。您将输出K / V类设置为泛型。在作业设置期间(这是运行时,而不是编译时),这些泛型将被擦除。

The reason is type erasure[1]. You set the output K/V classes as generics. During job setup (which is run time, not compile time), these generics are erased.

可以从输入文件中读取输入的k / v类,在SequenceFiles的情况下,类位于标题中 - 可以在打开序列文件时读取在编辑器中。
此标题必须写入,因为每个地图输出都是一个SequenceFile,所以你需要提供类。

The input k/v classes can be read from the input file, in the case of SequenceFiles the classes are in the header- you can read them when opening a sequence file in the editor. This header must be written, since every map output is a SequenceFile, so you need to provide the classes.

[1] http://download.oracle.com/javase/tutorial/java/generics/erasure.html < a>

[1] http://download.oracle.com/javase/tutorial/java/generics/erasure.html

这篇关于为什么我们需要在Hadoop程序中显式地设置输出键/值类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆