使用FileInputFormat在地图方法中获取行号 [英] Get Line number in map method using FileInputFormat

查看:89
本文介绍了使用FileInputFormat在地图方法中获取行号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以在我的地图方法中获取行号?
我的输入文件只是一列值,例如,

I was wondering whether it is possible to get the line number in my map method? My input file is just a single column of values like,


Apple
Orange
Banana

是否可以获得键值:1,值:Apple,键值:2,值:Orange ...在我的地图方法中?

Is it possible to get key: 1, Value: Apple , Key: 2, Value: Orange ... in my map method?

使用CDH3 / CDH4。更改输入数据以便使用KeyValueInputFormat不是一个选项。
谢谢你。

Using CDH3/CDH4. Changing the input data so as to use KeyValueInputFormat is not an option. Thanks ahead.

推荐答案

InputFormats的默认行为,例如TextInputFormat,是为了赋予记录的字节偏移比实际的行号 - 这主要是由于当输入文件可拆分并由两个或多个映射器处理时无法确定真实行号。

The default behaviour of InputFormats such as TextInputFormat is to give the byte offset of the record rather than the actual line number - this is mainly due to being unable to determine the true line number when an input file is splittable and being processed by two or more mappers.

您可以创建自己的InputFormat(基于 TextInputFormat 和关联的 LineRecordReader )来生成行号而不是字节偏移量,但是您'需要将输入格式配置为从 isSplittable 方法返回false(这意味着大型输入文件不会被多个映射器处理)。如果你有小文件或HDFS块大小接近的文件,那么这应该不成问题。另外,不可拆分的压缩格式(例如GZip .gz)意味着整个文件将由单个映射器处理。

You could create your own InputFormat (based upon the TextInputFormat and associated LineRecordReader) to produce line numbers rather than byte offsets but you'd need to configure your input format to return false from the isSplittable method (meaning that a large input file would not be processed by multiple mappers). If you have small files, or files that are close in size the HDFS block size then this shouldn't be a problem. Also non-splittable compression formats (GZip .gz for example) means the entire file will be processed by a single mapper anyway.

这篇关于使用FileInputFormat在地图方法中获取行号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆