Hadoop:如何访问要由地图/减少处理的(许多)照片图像? [英] Hadoop: how to access (many) photo images to be processed by map/reduce?

查看:25
本文介绍了Hadoop:如何访问要由地图/减少处理的(许多)照片图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在本地文件系统上保存了 1000 多万张照片.现在我想通过它们中的每一个来分析照片的二进制文件,看看它是否是一只狗.我基本上想在集群的hadoop环境上做分析.问题是,我应该如何为 map 方法设计输入? 比如说,在 map 方法中,new FaceDetection(photoInputStream).isDog() 是分析的所有底层逻辑.

I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method, new FaceDetection(photoInputStream).isDog() is all the underlying logic for the analysis.

具体来说,我应该将所有照片上传到 HDFS 吗?假设是,

Specifically, Should I upload all of the photos to HDFS? Assume yes,

  1. 如何在 map 方法中使用它们?

是否可以将输入(到 map)作为包含所有照片路径(在 HDFS 中)的文本文件,每行,并在 map 方法中,加载二进制文件,如:photoInputStream = getImageFromHDFS(photopath);(实际上,在执行过程中从 HDFS 加载文件的正确方法是什么地图方法?)

Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)

我好像错过了一些关于hadoopmap/reducehdfs的基本原理的知识,但是你能指出我吗就上述问题而言,谢谢!

It seems I miss some knowledges about the basic principle for hadoop, map/reduce and hdfs, but can you please point me out in terms of the above question, Thanks!

推荐答案

如何在 map 方法中使用它们?

how can I use them in the map method?

主要问题是每个文件都在一个文件中.因此,如果您有 1000 万个文件,那么您将有 1000 万个映射器,这听起来不太合理.您可能需要考虑将文件预序列化为 SequenceFiles(每个键值对一个图像).这将使数据加载到 MapReduce 作业本机,因此您不必编写任何棘手的代码.此外,如果您愿意,您还可以将所有的数据存储到一个 SequenceFile 中.Hadoop 可以很好地处理拆分 SequenceFile.

The major problem is that each file is going to be in one file. So if you have 10M files, you'll have 10M mappers, which doesn't sound terribly reasonable. You may want to considering pre-serializing the files into SequenceFiles (one image per key-value pair). This will make loading the data into the MapReduce job native, so you don't have to write any tricky code. Also, you'll be able to store all of your data into one SequenceFile, if you so desire. Hadoop handles splitting SequenceFiles quite well.

基本上,它的工作方式是,您将拥有一个单独的 Java 进程,该进程接收多个图像文件,将光线字节读取到内存中,然后将数据存储到 SequenceFile 中的键值对中.继续前进并继续写入 HDFS.这可能需要一段时间,但您只需要做一次.

Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once.

是否可以将输入(到地图)作为包含所有照片路径(在 HDFS 中)的文本文件,每一行,并在地图方法中,加载二进制文件,如: photoInputStream = getImageFromHDFS(photopath);(实际上,map方法执行过程中从HDFS加载文件的正确方法是什么?)

Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)

如果您有任何一种合理的集群(如果您正在考虑使用 Hadoop,您应该这样做)并且您实际上想要使用 Hadoop 的强大功能,这就不好了.您的 MapReduce 作业将启动并加载文件,但映射器将运行文本文件的本地数据,而不是图像!因此,基本上,由于 JobTracker 没有将任务放置在文件所在的位置,因此您将在任何地方混洗图像文件.这将导致大量的网络开销.如果您有 1TB 的图像,并且您有多个节点,则可以预期其中很多图像将通过网络流式传输.根据您的情况和集群大小(少于少数节点),这可能不会太糟糕.

This is not ok if you have any sort of reasonable cluster (which you should if you are considering Hadoop for this) and you actually want to be using the power of Hadoop. Your MapReduce job will fire off, and load the files, but the mappers will be running data-local to the text files, not the images! So, basically, you are going to be shuffling the image files everywhere since the JobTracker is not placing tasks where the files are. This will incur a significant amount of network overhead. If you have 1TB of images, you can expect that a lot of them will be streamed over the network if you have more than a few nodes. This may not be so bad depending on your situation and cluster size (less than a handful of nodes).

如果您确实想这样做,可以使用 FileSystem API 来创建文件(您需要 open 方法).

If you do want to do this, you can use the FileSystem API to create files (you want the open method).

这篇关于Hadoop:如何访问要由地图/减少处理的(许多)照片图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆