Hadoop:如何访问(许多)要由 map/reduce 处理的照片图像? [英] Hadoop: how to access (many) photo images to be processed by map/reduce?

查看:20
本文介绍了Hadoop:如何访问(许多)要由 map/reduce 处理的照片图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在本地文件系统中保存了 1000 万多张照片.现在我想通过它们中的每一个来分析照片的二进制文件,看看它是否是一只狗.我基本上想对集群的hadoop环境进行分析.问题是,我应该如何设计 map 方法的输入?比如说,在 map 方法中,new FaceDetection(photoInputStream).isDog() 是分析的所有底层逻辑.

I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method, new FaceDetection(photoInputStream).isDog() is all the underlying logic for the analysis.

具体来说,我应该将所有照片上传到 HDFS 吗?假设是,

Specifically, Should I upload all of the photos to HDFS? Assume yes,

  1. 如何在 map 方法中使用它们?

是否可以将输入(到 map)作为包含所有照片路径(在 HDFS 中)的文本文件,每行,并在 map 方法中,加载二进制文件,如: photoInputStream = getImageFromHDFS(photopath); (实际上,什么是在执行期间从 HDFS 加载文件的正确方法地图方法?)

Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)

hadoopmap/reducehdfs 的基本原理,好像有点漏掉了,请指点一下关于上述问题,谢谢!

It seems I miss some knowledges about the basic principle for hadoop, map/reduce and hdfs, but can you please point me out in terms of the above question, Thanks!

推荐答案

如何在 map 方法中使用它们?

how can I use them in the map method?

主要问题是每个文件都将位于一个文件中.所以如果你有 10M 的文件,你就会有 10M 的映射器,这听起来不太合理.您可能需要考虑将文件预序列化为 SequenceFiles(每个键值对一个图像).这将使数据加载到本机的 MapReduce 作业中,因此您不必编写任何棘手的代码.此外,如果您愿意,您还可以将所有数据存储到一个 SequenceFile 中.Hadoop 可以很好地处理拆分序列文件.

The major problem is that each file is going to be in one file. So if you have 10M files, you'll have 10M mappers, which doesn't sound terribly reasonable. You may want to considering pre-serializing the files into SequenceFiles (one image per key-value pair). This will make loading the data into the MapReduce job native, so you don't have to write any tricky code. Also, you'll be able to store all of your data into one SequenceFile, if you so desire. Hadoop handles splitting SequenceFiles quite well.

基本上,它的工作方式是,您将拥有一个单独的 Java 进程,该进程获取多个图像文件,将射线字节读入内存,然后将数据存储到 SequenceFile 中的键值对中.继续并继续写入 HDFS.这可能需要一段时间,但您只需执行一次.

Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once.

是否可以将输入(到地图)作为包含所有照片路径(在 HDFS 中)的文本文件,每行,并在地图方法中加载二进制文件,如: photoInputStream = getImageFromHDFS(photopath);(实际上,在map方法执行过程中,从HDFS加载文件的正确方法是什么?)

Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)

如果您有任何类型的合理集群(如果您为此考虑使用 Hadoop,则应该这样做)并且您实际上想要使用 Hadoop 的强大功能,这是不合适的.您的 MapReduce 作业将启动并加载文件,但映射器将运行文本文件的本地数据,而不是图像!因此,基本上,您将在任何地方打乱图像文件,因为 JobTracker 不会将任务放置在文件所在的位置.这将产生大量的网络开销.如果您有 1TB 的图像,如果您有多个节点,则可以预期其中很多将通过网络流式传输.这可能不是那么糟糕,具体取决于您的情况和集群大小(少于少数节点).

This is not ok if you have any sort of reasonable cluster (which you should if you are considering Hadoop for this) and you actually want to be using the power of Hadoop. Your MapReduce job will fire off, and load the files, but the mappers will be running data-local to the text files, not the images! So, basically, you are going to be shuffling the image files everywhere since the JobTracker is not placing tasks where the files are. This will incur a significant amount of network overhead. If you have 1TB of images, you can expect that a lot of them will be streamed over the network if you have more than a few nodes. This may not be so bad depending on your situation and cluster size (less than a handful of nodes).

如果您确实想这样做,可以使用 FileSystem API 来创建文件(你想要 open 方法).

If you do want to do this, you can use the FileSystem API to create files (you want the open method).

这篇关于Hadoop:如何访问(许多)要由 map/reduce 处理的照片图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆