是否可以限制 MapReduce 作业访问远程数据? [英] Is it possible to restrict a MapReduce job from accessing remote data?

查看:25
本文介绍了是否可以限制 MapReduce 作业访问远程数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有想要与 HDFS 集成的特定算法.该算法要求我们在本地访问数据(该工作将专门在 Mapper 中完成).但是,我们确实希望在分发文件(提供可靠性和条带化)方面利用 HDFS.计算完成后,我们将使用 Reducer 简单地发回答案,而不是执行任何额外的工作.避免使用网络是一个明确的目标.是否有允许我们限制网络数据访问的配置设置,以便在启动 MapReduce 作业时它只会访问它的本地 DataNode?

We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?

更新: 添加一点上下文

我们正在尝试用字符串匹配来分析这个问题.假设我们的集群有 N 个节点,并且一个文件存储了 N GB 的文本.该文件存储在 HDFS 中并以均匀的部分分布到节点(每个节点 1 个部分).我们能否创建一个 MapReduce 作业,在每个节点上启动一个进程以访问位于同一主机上的文件部分?或者,MapReduce 框架是否会不均匀地分配工作?(例如,1 个作业访问所有 N 部分数据,或者 .5N 个节点试图处理整个文件?

We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?

推荐答案

如果你将 reduce 任务的数量设置为零,你可以跳过洗牌,因此你的算法的网络成本.

If you set the number of reduce tasks to zero you can skip the shuffling and therefore the network cost of your algorithm.

在创建工作时,可以使用以下代码行来完成

While creating your job this can be done with the following line of code

job.setNumReduceTasks(0);

我不知道您的算法会做什么,但说它是一种模式匹配算法,用于查找特定单词的出现,然后映射器将报告每个拆分的匹配数.如果要添加计数,则需要网络通信和减速器.

I don't know what you algorithm will do but say it is a pattern matching algorithm looking for the occurrence of a particular word, then the mappers would report the number of matches per split. If you want to add the counts you need network communication and a reducer.

我发现的仅地图示例上的第一个 google 匹配:仅地图 MR 作业

First google match on a map-only example I found: Map-Only MR jobs

这篇关于是否可以限制 MapReduce 作业访问远程数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆