Hadoop:如何访问(许多)要由map / reduce处理的照片图像? [英] Hadoop: how to access (many) photo images to be processed by map/reduce?

查看:127
本文介绍了Hadoop:如何访问(许多)要由map / reduce处理的照片图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在本地文件系统上保存了10M以上的照片。现在我想通过其中的每一个来分析照片的二进制文件,看它是否是一只狗。我基本上想要对集群hadoop环境进行分析。问题是,我应该如何为map方法设计输入?。比方说,在map方法中,
new FaceDetection(photoInputStream).isDog()是分析的基础逻辑。

具体来说,
我应该将所有照片上传到 HDFS ?假设是,


  1. 如何在 map 方法中使用它们?

  2. 可以将输入(对 map )输入为包含所有使用每行一行的照片路径(在 HDFS 中),并在map方法中加载如下二进制文件: photoInputStream = getImageFromHDFS(photopath); (实际上,在执行map方法的过程中从HDFS加载文件的正确方法是什么)


看来我错过了一些关于 hadoop map / reduce hdfs ,但是请您根据上述问题指出我的意思,谢谢!




$ b pre
pre $ >主要的问题是每个文件将在一个文件中。所以如果你有10M的文件,你会有10M的映射器,这听起来不合理。您可能需要考虑将文件预先序列化为 SequenceFiles (每个键值对一个图像)。这将使数据加载到本地的MapReduce作业中,因此您不必编写任何棘手的代码。此外,如果您愿意,您还可以将所有数据存储到一个SequenceFile中。 Hadoop可以很好地处理分割SequenceFile。



基本上,它的工作方式是,您将拥有一个独立的Java进程,它会接收多个图像文件,将光线字节读入内存,然后将数据存储到SequenceFile中的键值对中。继续并继续写入HDFS。这可能需要一段时间,但你只需要做一次。







可以将输入(到地图)作为包含HDFS中所有照片路径的文本文件,并在map方法中加载如下二进制文件:photoInputStream = getImageFromHDFS(photopath); (实际上,在map方法执行期间从HDFS加载文件的正确方法是什么?)

这样做不行你有任何合理的集群(如果你正在考虑Hadoop的话你应该这样做),而且你真的想要使用Hadoop的力量。您的MapReduce作业将启动并加载文件,但映射器将运行数据本地到文本文件,而不是图像!因此,基本上,由于JobTracker并未将任务放在文件所在的位置,因此您将随时随地对图像文件进行洗牌。这会招致大量的网络开销。如果您拥有1TB的图像,则可以预计,如果您的节点数量不止一个,它们中的很多将通过网络进行流式传输。这可能不是很糟糕,这取决于您的情况和群集大小(少于几个节点)。



如果您确实想要这样做,您可以使用< a href =http://hadoop.apache.org/common/docs/r0.22.0/api/org/apache/hadoop/fs/FileSystem.html =noreferrer> FileSystem API来创建文件(您希望打开方法)。


I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method, new FaceDetection(photoInputStream).isDog() is all the underlying logic for the analysis.

Specifically, Should I upload all of the photos to HDFS? Assume yes,

  1. how can I use them in the map method?

  2. Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)

It seems I miss some knowledges about the basic principle for hadoop, map/reduce and hdfs, but can you please point me out in terms of the above question, Thanks!

解决方案

how can I use them in the map method?

The major problem is that each file is going to be in one file. So if you have 10M files, you'll have 10M mappers, which doesn't sound terribly reasonable. You may want to considering pre-serializing the files into SequenceFiles (one image per key-value pair). This will make loading the data into the MapReduce job native, so you don't have to write any tricky code. Also, you'll be able to store all of your data into one SequenceFile, if you so desire. Hadoop handles splitting SequenceFiles quite well.

Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once.


Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)

This is not ok if you have any sort of reasonable cluster (which you should if you are considering Hadoop for this) and you actually want to be using the power of Hadoop. Your MapReduce job will fire off, and load the files, but the mappers will be running data-local to the text files, not the images! So, basically, you are going to be shuffling the image files everywhere since the JobTracker is not placing tasks where the files are. This will incur a significant amount of network overhead. If you have 1TB of images, you can expect that a lot of them will be streamed over the network if you have more than a few nodes. This may not be so bad depending on your situation and cluster size (less than a handful of nodes).

If you do want to do this, you can use the FileSystem API to create files (you want the open method).

这篇关于Hadoop:如何访问(许多)要由map / reduce处理的照片图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆