是否可以限制MapReduce作业访问远程数据? [英] Is it possible to restrict a MapReduce job from accessing remote data?

查看:72
本文介绍了是否可以限制MapReduce作业访问远程数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有特定的算法,我们想要与HDFS集成。该算法要求我们在本地访问数据(工作将完全在 Mapper 中完成)。但是,我们希望在分发文件方面利用HDFS(提供可靠性和条带化)。计算完成后,我们将使用 Reducer 来简单地发回答案,而不是执行任何额外的工作。避免网络使用是一个明确的目标。是否有一个配置设置可以让我们限制网络数据访问,这样当一个MapReduce作业启动时,它只会访问它的本地DataNode? b
更新: 添加一些上下文



我们试图用字符串匹配来分析这个问题。假设我们的集群有N个节点,并且文件以N GB文本存储。该文件存储在HDFS中,并以偶数部分分布到节点(每节点1个部分)。我们可以创建一个MapReduce作业,在每个节点上启动一个进程来访问位于同一主机上的文件部分?或者,MapReduce框架会不均匀分配工作吗? (例如,1个作业访问数据,或尝试处理整个文件.5N节点的所有N个部分?


解决方案

如果您将减少任务的数量设置为零,您可以跳过洗牌,从而避免算法的网络成本。



在创建您的工作时,可以使用以下行代码

  job.setNumReduceTasks(0); 

我不知道你的算法会做什么,而是说它是一种模式匹配算法,用于查找特定单词的出现情况,然后映射器会报告每个分割的匹配数。如果你想添加计数,你需要网络通信和一个reducer。



第一次谷歌匹配在我发现的只有地图的例子:
仅映射的MR作业


We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?

UPDATE: Adding a bit of context

We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?

解决方案

If you set the number of reduce tasks to zero you can skip the shuffling and therefore the network cost of your algorithm.

While creating your job this can be done with the following line of code

job.setNumReduceTasks(0);

I don't know what you algorithm will do but say it is a pattern matching algorithm looking for the occurrence of a particular word, then the mappers would report the number of matches per split. If you want to add the counts you need network communication and a reducer.

First google match on a map-only example I found: Map-Only MR jobs

这篇关于是否可以限制MapReduce作业访问远程数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆