自定义二进制输入 - Hadoop [英] Custom Binary Input - Hadoop

查看：202 发布时间：2018/6/1 12:44:45 java hadoop mapreduce

本文介绍了自定义二进制输入 - Hadoop的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在Hadoop中开发一个演示应用程序，我的输入是.mrc图像文件。我想将它们加载到hadoop并对它们执行一些图像处理。

这些是二进制文件，其中包含一个带有元数据的大标题，后跟一组图像的数据。关于如何读取图像的信息也包含在标题中（例如number_of_images，number_of_pixels_x，number_of_pixels_y，bytes_per_pixel，所以在标题字节之后，第一个[number_of_pixels_x * number_of_pixels_y * bytes_per_pixel]

这些类型的文件有什么好的输入格式？我认为有两种可能的解决方案：

 / p> 
 
 
 
 通过将元数据放入序列文件标题并将每个图像配对，将它们转换为序列文件，在这种情况下，我可以访问元数据来自所有映射器？
 
 编写自定义的InputFormat和RecordReader，并在将元数据放入分布式缓存中时为每个映像创建拆分。 
 
我是Hadoop中的新成员，所以我可能会漏掉一些东西。您认为哪种方法更好？是我缺少的其他方法？ 
 
解决方案
不知道你的文件格式，第一个选择似乎是更好的选择。使用序列文件可以利用很多SequenceFile相关工具来获得更好的性能。但是，这种方法有两个问题需要关注。 
 
 
 
 如何将.mrc文件转换为.seq格式？ 
 
 您提到头部很大，这可能会降低SequenceFiles的性能。
 
 
但即使有这些担忧，我认为在SequenceFile中表示数据是最好的选择。
 
I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them.

These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per_pixel] are the first image, then the second and so on].

What is a good Input format for these kinds of files? I thought two possible solutions:

Convert them to sequence files by placing the metadata in the sequence file header and have  pairs for each image. In this case can I access the metadata from all mappers?
Write a custom InputFormat and RecordReader and create splits for each image while placing the metadata in distributed cache.
I am new in Hadoop, so I may be missing something. Which approach you think is better? is any other way that I am missing?
 解决方案 
Without knowing your file formats, the first option seems to be the better option. Using sequence files you can leverage a lot of SequenceFile related tools to get better performance. However, there are two things that do concern me with this approach.

How will you get your .mrc files into a .seq format?
You mentioned that the header is large, this may reduce the performance of SequenceFiles
But even with those concerns, I think that representing your data in SequenceFile's is the best option.

                        这篇关于自定义二进制输入 -  Hadoop的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

自定义二进制输入 - Hadoop [英] Custom Binary Input - Hadoop

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

自定义二进制输入 - Hadoop [英] Custom Binary Input - Hadoop

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭