自定义二进制输入 - Hadoop [英] Custom Binary Input - Hadoop

查看:202
本文介绍了自定义二进制输入 - Hadoop的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Hadoop中开发一个演示应用程序,我的输入是.mrc图像文件。我想将它们加载到hadoop并对它们执行一些图像处理。

这些是二进制文件,其中包含一个带有元数据的大标题,后跟一组图像的数据。关于如何读取图像的信息也包含在标题中(例如number_of_images,number_of_pixels_x,number_of_pixels_y,bytes_per_pixel,所以在标题字节之后,第一个 [number_of_pixels_x * number_of_pixels_y * bytes_per_pixel]

这些类型的文件有什么好的输入格式?我认为有两种可能的解决方案:

/ p>


  1. 通过将元数据放入序列文件标题并将每个图像配对,将它们转换为序列文件,在这种情况下,我可以访问元数据来自所有映射器?

  2. 编写自定义的InputFormat和RecordReader,并在将元数据放入分布式缓存中时为每个映像创建拆分。

我是Hadoop中的新成员,所以我可能会漏掉一些东西。您认为哪种方法更好?是我缺少的其他方法?

解决方案

不知道你的文件格式,第一个选择似乎是更好的选择。使用序列文件可以利用很多SequenceFile相关工具来获得更好的性能。但是,这种方法有两个问题需要关注。


  1. 如何将.mrc文件转换为.seq格式?

  2. 您提到头部很大,这可能会降低SequenceFiles的性能。

但即使有这些担忧,我认为在SequenceFile中表示数据是最好的选择。


I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them.

These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per_pixel] are the first image, then the second and so on].

What is a good Input format for these kinds of files? I thought two possible solutions:

  1. Convert them to sequence files by placing the metadata in the sequence file header and have pairs for each image. In this case can I access the metadata from all mappers?
  2. Write a custom InputFormat and RecordReader and create splits for each image while placing the metadata in distributed cache.

I am new in Hadoop, so I may be missing something. Which approach you think is better? is any other way that I am missing?

解决方案

Without knowing your file formats, the first option seems to be the better option. Using sequence files you can leverage a lot of SequenceFile related tools to get better performance. However, there are two things that do concern me with this approach.

  1. How will you get your .mrc files into a .seq format?
  2. You mentioned that the header is large, this may reduce the performance of SequenceFiles

But even with those concerns, I think that representing your data in SequenceFile's is the best option.

这篇关于自定义二进制输入 - Hadoop的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆