Mapper在Hadoop中输入Key-Value对 [英] Mapper input Key-Value pair in Hadoop

查看:657
本文介绍了Mapper在Hadoop中输入Key-Value对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常情况下,我们写映射器的形式为:

  public static class Map扩展Mapper< ** LongWritable **,文字,文字,IntWritable> 

这里映射器的输入键值对是< LongWritable, Text> - 据我所知,当mapper获得输入数据时,它会逐行显示 - 因此mapper的键表示行号 - 如果我错了,请纠正我的错误。 p>

我的问题是:如果我将映射器的输入键值对作为< Text,Text> ,那么它给出错误

  java.lang.ClassCastException:org.apache.hadoop.io.LongWritable不能转换为org.apache .hadoop.io.Text 

必须将输入的映射关键值对作为< LongWritable,Text> - 如果是,那为什么?如果没有,那么错误的原因是什么?您能帮我理解错误的正确推理吗?



预先致谢。

解决方案

映射器的输入取决于使用什么InputFormat。 InputFormat负责读取传入数据并将其转换为Mapper预期的格式。默认的InputFormat是 TextInputFormat ,它扩展了 FileInputFormat< LongWritable,Text>



如果您不更改InputFormat,则使用具有不同键值类型签名而不是< LongWritable,Text> 的Mapper会导致此错误。如果您期望< Text,Text> 输入,您将不得不选择适当的InputFormat。您可以在Job设置中设置InputFormat:

  job.setInputFormatClass(MyInputFormat.class); 

就像我说的,默认情况下,它被设置为TextInputFormat。



现在,假设您的输入数据是一串用逗号分隔的以换行符分隔的记录:


  • A,value1

  • B,value2



如果您希望输入键(A,value1),(B,value2),您将必须使用< Text,Text> 签名。 幸运,这很简单。这里有示例,并且可能还有一些在StackOverflow周围的示例。

简而言之,添加一个扩展 FileInputFormat< Text,Text> 的类和一个扩展 RecordReader<文本,文本> 。覆盖 FileInputFormat#getRecordReader 方法,并让它返回一个您自定义RecordReader的实例。

然后你将拥有实现所需的RecordReader逻辑。最简单的方法是创建一个 LineRecordReader ,并将所有基本职责委托给此实例。在getCurrentKey和getCurrentValue方法中,您将通过调用 LineRecordReader#getCurrentValue 并使用逗号分割它来实现用于提取逗号分隔文本内容的逻辑。



最后,将新的InputFormat设置为Job InputFormat,如上面第二段所示。


Normally, we write the mapper in the form :

public static class Map extends Mapper<**LongWritable**, Text, Text, IntWritable>

Here the input key-value pair for the mapper is <LongWritable, Text> - as far as I know when the mapper gets the input data its goes through line by line - so the Key for the mapper signifies the line number - please correct me if I am wrong.

My question is : If I give the input key-value pair for mapper as <Text, Text> then it is giving the error

 java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

Is it a mandatory to give the input key-value pair of mapper as <LongWritable, Text> - if yes then why ? if no then what the reason of the error ? Can you please help me understand the proper reasoning of the error ?

Thanks in advance.

解决方案

The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends FileInputFormat<LongWritable, Text>.

If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <LongWritable, Text> will cause this error. If you expect <Text, Text> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup:

job.setInputFormatClass(MyInputFormat.class);

And like I said, by default this is set to TextInputFormat.

Now, let's say your input data is a bunch of newline-separated records delimited by a comma:

  • "A,value1"
  • "B,value2"

If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <Text, Text> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around StackOverflow as well.

In short, add a class which extends FileInputFormat<Text, Text> and a class which extends RecordReader<Text, Text>. Override the FileInputFormat#getRecordReader method, and have it return an instance of your custom RecordReader.

Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling LineRecordReader#getCurrentValue and splitting it on comma.

Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.

这篇关于Mapper在Hadoop中输入Key-Value对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆