在hadoop mapreduce作业上增加/减少交互节点数量是个好主意吗? [英] when is it a good idea to increase/decrease the number of nodes interactively on a hadoop mapreduce job?

查看:174
本文介绍了在hadoop mapreduce作业上增加/减少交互节点数量是个好主意吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个直觉,认为在运行作业中交互式增加/减少
数量的节点可以加速地图繁重的
工作,但无助于减少繁重的工作,大部分工作完成
by reduce。



关于此问题有一个常见问题,但它不能很好地解释



http://aws.amazon.com/elasticmapreduce/faqs/#cluster-18

这个问题由Christopher Smith回答,他允许我在这里发帖。






一如既往......取决于。有一件事你可以总是计算
:稍后添加节点不会像从
节点中获得
一样多。



创建Hadoop作业时,会将其分解为任务。这些
任务实际上是工作原子。 Hadoop可以让您在作业创建过程中调整
映射器和#减速器任务的数量,但一旦创建作业
,它就是静态的。任务分配给插槽。传统上,每个节点
被配置为具有一定数量的地图
任务的地理位置,以及一定数量的地点用于减少任务,但是您可以通过
来调整。一些较新版本的Hadoop并不要求
将插槽指定为map或reduce任务。无论如何,
JobTracker会定期将任务分配给插槽。因为这是动态完成的
,所以新的节点可以通过提供更多的插槽来执行任务来加速
作业的处理。



这为理解添加新节点的现实奠定了基础。
显然有一个Amdahl的法律问题,其中有更多的槽位比
挂起的任务完成得很少(如果你有推测执行
启用,它确实有些帮助,因为Hadoop将安排相同的任务
在许多不同的节点上运行,这样如果有多余的资源,慢节点的任务可以由更快的节点完成
)。所以,如果你的
没有用很多map或reduce任务来定义你的工作,那么添加更多的
节点不会有多大帮助。当然,每项任务都会带来一些
的开销,所以你也不想疯了。这就是为什么我b $ b建议任务大小的指导方针应该是需要
〜2-5分钟执行的东西。



当然,当你动态添加节点时,他们有另一个
的缺点:他们没有任何本地数据。显然,如果你在
开始一个EMR管道,没有一个节点有数据,所以
无关紧要,但是如果你有一个EMR管道由许多工作组成,
与较早的工作持续他们的结果到HDFS,您将获得巨大的
性能提升,因为JobTracker将有利于整形和
分配任务,因此节点具有可爱的数据位置(这是
整个MapReduce设计的核心技巧是最大化性能)。在
减速器方面,数据来自其他地图任务,因此与其他节点相比,动态
增加的节点确实没有什么不利。



因此,原则上动态添加新节点实际上不太可能是
来帮助从HDFS读取IO接口映射任务。

除了...之外...

Hadoop在封面下有各种作弊优化
性能。一旦它开始将地图输出数据传输到
减速器之前地图任务完成/减速器启动。这
显然是映射器
产生大量数据的工作的关键优化。当Hadoop开始启动
传输时,您可以调整它。无论如何,这意味着一个新加入的节点可能是
处于劣势,因为现有节点可能已经具有这样的
巨大的数据优势。显然,映射器
传输的输出越多,缺点就越大。

这就是它的全部功能。但实际上,很多Hadoop
作业都有映射器以CPU密集型方式处理大量数据,
但输出相对较少的数据给reducer(或
可能会发送大量数据到还原器,但还原器仍然是
非常简单,所以根本没有CPU绑定)。通常,作业将会有少量
(有时甚至是0)reducer任务,因此即使额外的节点也可以提供帮助,如果
已经有可用于每个未完成减少
任务的减少槽,帮助。由于显而易见的原因,新的节点也不成比例地帮助
,因为显而易见的原因是,因为这个趋于胜利。如果您的映射器是I / O绑定的并且从
网络中提取数据,那么添加新节点显然会增加集群的总带宽
,所以它在那里有帮助,但是如果您的映射任务是I / O绑定
读HDFS,最好的办法是有更多的初始节点,数据
已经遍布HDFS。看到减速器因为结构不好而不能正常工作,这种情况并不少见,在这种情况下,增加更多的
节点可以提供很多帮助,因为它会再次分配带宽。



当然还有一个警告:在一个非常小的集群中,
reducer可以从本地节点上
上运行的映射器中读取大量数据,并且增加更多的节点将更多的数据转移到慢得多的网络上的
。您也可以使用
reducer将大部分时间花费在从发送数据的所有映射器复用数据处理
的情况(尽管可以调整为
)。



如果您问这样的问题,我强烈建议使用Amazon提供的KarmaSphere来分析
的工作。它
将让你更好地了解你的瓶颈在哪里,
是你改善表现的最佳策略。

I have an intuition that increasing/decreasing number of nodes interactively on running job can speed up map-heavy jobs, but won't help wth reduce heavy jobs, where most of work is done by reduce.

There's an faq about this but it doesn't really explain very well

http://aws.amazon.com/elasticmapreduce/faqs/#cluster-18

解决方案

This question was answered by Christopher Smith, who gave me permission to post here.


As always... "it depends". One thing you can pretty much always count on: adding nodes later on is not going to help you as much as having the nodes from the get go.

When you create a Hadoop job, it gets split up in to tasks. These tasks are effectively "atoms of work". Hadoop lets you tweak the # of mapper and # of reducer tasks during job creation, but once the job is created, it is static. Tasks are assigned to "slots". Traditionally, each node is configured to have a certain number of slots for map tasks, and a certain number of slots for reduce tasks, but you can tweak that. Some newer versions of Hadoop don't require you to designate the slots as being for map or reduce tasks. Anyway, the JobTracker periodically assigns tasks to slots. Because this is done dynamically, new nodes coming online can speed up the processing of a job by providing more slots to execute the tasks.

This sets the stage for understanding the reality of adding new nodes. There's obviously an Amdahl's law issue where having more slots than pending tasks accomplishes little (if you have speculative execution enabled, it does help somewhat, as Hadoop will schedule the same task to run on many different nodes, so that a slow node's tasks can be completed by faster nodes if there are spare resources). So, if you didn't define your job with many map or reduce tasks, adding more nodes isn't going to help much. Of course, each task imposes some overhead, so you don't want to go crazy high either. That's why I suggest a guideline for task size should be "something which takes ~2-5 minutes to execute".

Of course, when you add nodes dynamically, they have one other disadvantage: they don't have any data local. Obviously, if you are at the start of a EMR pipeline, none of the nodes have data in them, so doesn't matter, but if you have an EMR pipeline made of many jobs, with earlier jobs persisting their results to HDFS, you get a huge performance boost because the JobTracker will favour shaping and assigning tasks so nodes have that lovely locality of data (this is a core trick of the whole MapReduce design to maximize performance). On the reducer side, data is coming from other map tasks, so dynamically added nodes are really at no disadvantage as compared to other nodes.

So, in principle, dynamically adding new nodes is actually less likely to help with IO bound map tasks that are reading from HDFS.

Except...

Hadoop has a variety of cheats under the covers to optimize performance. Once is that it starts transmitting map output data to the reducers before the map task completes/the reducer starts. This obviously is a critical optimization for jobs where the mappers generate a lot of data. You can tweak when Hadoop starts to kick off the transfers. Anyway, this means that a newly spun up node might be at a disadvantage, because the existing nodes might already have such a huge data advantage. Obviously, the more output that the mappers have transmitted, the larger the disadvantage.

That's how it all really works. In practice though, a lot of Hadoop jobs have mappers processing tons of data in a CPU intensive fashion, but outputting comparatively little data to the reducers (or they might send a lot of data to the reducers, but the reducers are still very simple, so not CPU bound at all). Often jobs will have few (sometimes even 0) reducer tasks, so even extra nodes could help, if you already have a reduce slot available for every outstanding reduce task, new nodes can't help. New nodes also disproportionately help out with CPU bound work, for obvious reasons, so because that tends to be map tasks more than reduce tasks, that's where people typically see the win. If your mappers are I/O bound and pulling data from the network, adding new nodes obviously increases the aggregate bandwidth of the cluster, so it helps there, but if your map tasks are I/O bound reading HDFS, the best thing is to have more initial nodes, with data already spread over HDFS. It's not unusual to see reducers get I/O bound because of poorly structured jobs, in which case adding more nodes can help a lot, because it splits up the bandwidth again.

There's a caveat there too of course: with a really small cluster, reducers get to read a lot of their data from the mappers running on the local node, and adding more nodes shifts more of the data to being pulled over the much slower network. You can also have cases where reducers spend most of their time just multiplexing data processing from all the mappers sending them data (although that is tunable as well).

If you are asking questions like this, I'd highly recommend profiling your job using something like Amazon's offering of KarmaSphere. It will give you a better picture of where your bottlenecks are and what are your best strategies for improving performance.

这篇关于在hadoop mapreduce作业上增加/减少交互节点数量是个好主意吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆