Hadoop中的映射器输入键值对 [英] Mapper input Key-Value pair in Hadoop

查看:28
本文介绍了Hadoop中的映射器输入键值对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常,我们将映射器写成以下形式:

Normally, we write the mapper in the form :

public static class Map extends Mapper<**LongWritable**, Text, Text, IntWritable>

这里映射器的输入键值对是 <LongWritable, Text> - 据我所知,当映射器获取输入数据时,它会逐行传递 - 所以 Key因为映射器表示行号 - 如果我错了,请纠正我.

Here the input key-value pair for the mapper is <LongWritable, Text> - as far as I know when the mapper gets the input data its goes through line by line - so the Key for the mapper signifies the line number - please correct me if I am wrong.

我的问题是:如果我将映射器的输入键值对指定为 <Text, Text> 那么它会给出错误

My question is : If I give the input key-value pair for mapper as <Text, Text> then it is giving the error

 java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

是否必须将映射器的输入键值对指定为 <LongWritable, Text> - 如果是,那么为什么?如果没有,那么错误的原因是什么?你能帮我理解错误的正确原因吗?

Is it a mandatory to give the input key-value pair of mapper as <LongWritable, Text> - if yes then why ? if no then what the reason of the error ? Can you please help me understand the proper reasoning of the error ?

提前致谢.

推荐答案

映射器的输入取决于使用的 InputFormat.InputFormat 负责读取传入的数据并将其调整为 Mapper 期望的任何格式.默认 InputFormat 是 TextInputFormat,它扩展了 FileInputFormat<LongWritable, Text>.

The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends FileInputFormat<LongWritable, Text>.

如果不更改 InputFormat,使用与 具有不同 Key-Value 类型签名的 Mapper 将导致此错误.如果您希望 输入,则必须选择适当的 InputFormat.您可以在 Job setup 中设置 InputFormat:

If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <LongWritable, Text> will cause this error. If you expect <Text, Text> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup:

job.setInputFormatClass(MyInputFormat.class);

就像我说的,默认设置为 TextInputFormat.

And like I said, by default this is set to TextInputFormat.

现在,假设您的输入数据是一堆由逗号分隔的换行符分隔的记录:

Now, let's say your input data is a bunch of newline-separated records delimited by a comma:

  • "A,value1"
  • B,value2"

如果您希望映射器的输入键为 ("A", "value1"), ("B", "value2"),则必须使用 < 实现自定义 InputFormat 和 RecordReader.文字,文字>签名.幸运的是,这很容易.这里有一个例子,可能还有一些例子漂浮在 StackOverflow 上.

If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <Text, Text> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around StackOverflow as well.

简而言之,添加一个扩展 FileInputFormat 的类和一个扩展 RecordReader 的类.覆盖 FileInputFormat#getRecordReader 方法,并让它返回您的自定义 RecordReader 的实例.

In short, add a class which extends FileInputFormat<Text, Text> and a class which extends RecordReader<Text, Text>. Override the FileInputFormat#getRecordReader method, and have it return an instance of your custom RecordReader.

然后您将必须实现所需的 RecordReader 逻辑.最简单的方法是创建 LineRecordReader 在您的自定义 RecordReader 中,并将所有基本职责委托给此实例.在 getCurrentKey 和 getCurrentValue 方法中,您将通过调用 LineRecordReader#getCurrentValue 并将其拆分为逗号来实现提取逗号分隔文本内容的逻辑.

Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling LineRecordReader#getCurrentValue and splitting it on comma.

最后,将您的新 InputFormat 设置为 Job InputFormat,如上面第二段之后所示.

Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.

这篇关于Hadoop中的映射器输入键值对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆