Hadoop一次读取多行 [英] Hadoop read multiple lines at a time

查看:171
本文介绍了Hadoop一次读取多行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,其中每四行代表一个记录。

例如,前四行代表记录1,接下来的四行代表记录2等等..

我怎样才能确保Mapper一次输入这四行?



另外,我想要文件在Hadoop中发生分割以发生在记录边界(行号应该是4的倍数),因此记录不会跨越多个分割文件。



可以这样做吗?

解决方案

几种方法,比其他方法更肮脏:




正确的方法

您可能必须定义您自己的 RecordReader InputSplit InputFormat 。具体取决于你想要做什么,你将能够重用以上三个中已有的一些。您可能必须编写自己的RecordReader来定义键/值对,并且您可能必须编写自己的InputSplit来帮助定义边界。






另一种正确的方式,这可能是不可能的 上面的任务是相当艰巨的。你有任何控制你的数据集?你可以预先处理它(无论是进入还是休息)?如果是这样,您应该强烈考虑尝试将数据集转换为Hadoop中易于阅读的内容。



类似于:

  ALine1 
ALine2 ALine1; Aline2; Aline3; Aline4
ALine3
ALine4 - >
BLine1
BLine2 BLine1; Bline2; Bline3; Bline4;
BLine3
BLine4






肮脏



您是否可以控制数据的文件大小?如果您在块边界上手动分割数据,则可以强制Hadoop不关心跨越分割的记录。例如,如果您的块大小为64MB,请将文件写入60MB块。



不用担心输入分割,您可以执行一些肮脏的操作:在map函数中,将您的新密钥/值对添加到列表对象中。如果列表对象中有4个项目,请执行处理,发出一些内容,然后清除列表。否则,不要发出任何东西,不要做任何事情。



您必须手动分割数据的原因是,您不能保证整个4行记录将被赋予相同的地图任务。


I have a file in which a set of every four lines represents a record.

eg, first four lines represent record1, next four represent record 2 and so on..

How can I ensure Mapper input these four lines at a time?

Also, I want the file splitting in Hadoop to happen at the record boundary (line number should be a multiple of four), so records don't get span across multiple split files..

How can this be done?

解决方案

A few approaches, some dirtier than others:


The right way

You may have to define your own RecordReader, InputSplit, and InputFormat. Depending on exactly what you are trying to do, you will be able to reuse some of the already existing ones of the three above. You will likely have to write your own RecordReader to define the key/value pair and you will likely have to write your own InputSplit to help define the boundary.


Another right way, which may not be possible

The above task is quite daunting. Do you have any control over your data set? Can you preprocess it in someway (either while it is coming in or at rest)? If so, you should strongly consider trying to transform your dataset int something that is easier to read out of the box in Hadoop.

Something like:

ALine1
ALine2            ALine1;Aline2;Aline3;Aline4
ALine3
ALine4        ->
BLine1
BLine2            BLine1;Bline2;Bline3;Bline4;
BLine3
BLine4


Down and Dirty

Do you have any control over the file sizes of your data? If you manually split your data on the block boundary, you can force Hadoop to not care about records spanning splits. For example, if your block size is 64MB, write your files out in 60MB chunks.

Without worrying about input splits, you could do something dirty: In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.

The reason why you have to manually split the data is that you are not going to be guaranteed that an entire 4-row record will be given to the same map task.

这篇关于Hadoop一次读取多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆