自定义二进制输入 - Hadoop [英] Custom Binary Input - Hadoop
问题描述
我正在Hadoop中开发一个演示应用程序,我的输入是.mrc图像文件。我想将它们加载到hadoop并对它们执行一些图像处理。
这些是二进制文件,其中包含一个带有元数据的大标题,后跟一组图像的数据。关于如何读取图像的信息也包含在标题中(例如number_of_images,number_of_pixels_x,number_of_pixels_y,bytes_per_pixel,所以在标题字节之后,第一个 这些类型的文件有什么好的输入格式?我认为有两种可能的解决方案: [number_of_pixels_x * number_of_pixels_y * bytes_per_pixel] $ c
- 通过将元数据放入序列文件标题并将每个图像配对,将它们转换为序列文件,在这种情况下,我可以访问元数据来自所有映射器?
- 编写自定义的InputFormat和RecordReader,并在将元数据放入分布式缓存中时为每个映像创建拆分。
我是Hadoop中的新成员,所以我可能会漏掉一些东西。您认为哪种方法更好?是我缺少的其他方法?
不知道你的文件格式,第一个选择似乎是更好的选择。使用序列文件可以利用很多SequenceFile相关工具来获得更好的性能。但是,这种方法有两个问题需要关注。
- 如何将.mrc文件转换为.seq格式?
- 您提到头部很大,这可能会降低SequenceFiles的性能。
但即使有这些担忧,我认为在SequenceFile中表示数据是最好的选择。
I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them.
These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per_pixel]
are the first image, then the second and so on].
What is a good Input format for these kinds of files? I thought two possible solutions:
- Convert them to sequence files by placing the metadata in the sequence file header and have pairs for each image. In this case can I access the metadata from all mappers?
- Write a custom InputFormat and RecordReader and create splits for each image while placing the metadata in distributed cache.
I am new in Hadoop, so I may be missing something. Which approach you think is better? is any other way that I am missing?
Without knowing your file formats, the first option seems to be the better option. Using sequence files you can leverage a lot of SequenceFile related tools to get better performance. However, there are two things that do concern me with this approach.
- How will you get your .mrc files into a .seq format?
- You mentioned that the header is large, this may reduce the performance of SequenceFiles
But even with those concerns, I think that representing your data in SequenceFile's is the best option.
这篇关于自定义二进制输入 - Hadoop的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!