在Hadoop MapReduce中为二进制文件创建自定义InputFormat和RecordReader [英] Creating custom InputFormat and RecordReader for Binary Files in Hadoop MapReduce

查看:445
本文介绍了在Hadoop MapReduce中为二进制文件创建自定义InputFormat和RecordReader的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个M / R作业,用于处理以二进制格式编写的大型时间序列数据文件,该文件看起来像这样(为了便于阅读,此处新行显示,实际数据显然是连续的):

b
$ b

  TIMESTAMP_1 --------------------- TIMESTAMP_1 
TIMESTAMP_2 *** ******* TIMESTAMP_2
TIMESTAMP_3 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% TIMESTAMP_3
.. etc

其中timestamp只是一个8字节的结构体,可以通过第一个2个字节。如上所示,实际数据被限制在重复值时间戳之间,并且包含一个或多个预定义的结构。我想写一个自定义的InputFormat,它将发送键/值对映射器:

 < TIMESTAMP_1,---------------------> 
< TIMESTAMP_2,**********>
< TIMESTAMP_3,%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%>

逻辑上,我想跟踪当前 TIMESTAMP ,然后汇总所有数据,直到再次检测到 TIMESTAMP ,然后发出我的< TIMESTAMP,DATA> 配对作为记录。我的问题是在 RecordReader 内的拆分之间进行同步,所以如果某个读者收到下面的拆分

 #在我的数据中出现分裂
reader X:TIMESTAMP_1 --------------
reader Y:------- TIMESTAMP_1 TIMESTAMP_2 **** ..

#或时间戳内
或甚至:@@@@@@@ TIMES
TAMP_1 ---------- ---- ..

有什么方法可以解决这个问题?我是否有一种简单的方法来访问文件偏移量,以便我的 CustomRecordReader 可以在分割之间同步并且不会丢失数据?我觉得我在处理分裂方面存在一些概念上的空白,所以也许对这些解释有所帮助。谢谢。

解决方案

一般来说,创建支持拆分的输入格式并不简单,因为您应该能够找出哪里从分界线移动以获得一致的记录。 XmlInputFormat是这种格式的好例子。

我建议首先考虑您是否确实需要可拆分输入?您可以将输入格式定义为不可拆分,并且不具有所有这些问题。

如果你的文件一般不会大得多,那么块大小 - 你什么都没有。如果他们这样做 - 你将失去部分数据局部性。


I'm writing a M/R job that processes large time-series-data files written in binary format that looks something like this (new lines here for readability, actual data is continuous, obviously):

TIMESTAMP_1---------------------TIMESTAMP_1
TIMESTAMP_2**********TIMESTAMP_2 
TIMESTAMP_3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%TIMESTAMP_3
.. etc

Where timestamp is simply a 8 byte struct, identifiable as such by the first 2 bytes. The actual data is bounded between duplicate value timestamps, as displayed above, and contains one or more predefined structs. I would like to write a custom InputFormat that will emit the key/value pair to the mappers:

< TIMESTAMP_1, --------------------- >
< TIMESTAMP_2, ********** >
< TIMESTAMP_3, %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >

Logically, I'd like to keep track of the current TIMESTAMP, and aggregate all the data until that TIMESTAMP is detected again, then send out my <TIMESTAMP, DATA> pair as a record. My problem is syncing between splits inside the RecordReader, so if a certain reader receives the following split

# a split occurs inside my data
reader X: TIMESTAMP_1--------------
reader Y: -------TIMESTAMP_1 TIMESTAMP_2****..

# or inside the timestamp
or even: @@@@@@@TIMES
         TAMP_1-------------- ..

What's a good way to approach this? Do I have an easy way to access the file offsets such that my CustomRecordReader can sync between splits and not lose data? I feel I have some conceptual gaps on how splits are handled, so perhaps an explanation of these may help. thanks.

解决方案

In general it is not simple to create input format which support splits, since you should be able to find out where to move from the split boundary to get consistent records. XmlInputFormat is good example of format doing so.
I would suggest first consider if you indeed need splittable inputs? You can define your input format as not splittable and not have all these issues.
If you files are generally not much larger then block size - you loose nothing. If they do - you will loose part of the data locality.

这篇关于在Hadoop MapReduce中为二进制文件创建自定义InputFormat和RecordReader的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆