如何调整mapred.reduce.parallel.copies? [英] how to tune mapred.reduce.parallel.copies?

查看:199
本文介绍了如何调整mapred.reduce.parallel.copies?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

继阅读 http:// gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html ,我们想用mapred.reduce.parallel.copies进行试验。

博客提到仔细查看日志。我们如何知道我们已经达到了最佳位置?我们应该寻找什么?我们如何检测到我们是过度并行化的?

解决方案

为了做到这一点,您应该基本查找4件事情: CPU,RAM,磁盘和网络。如果您的设置超出了这些指标的阈值,您可以推断出您正在推动限制。例如,如果您已将mapred.reduce.parallel.copies的值设置为远高于可用内核数的值,则基于此属性,您将等待处于等待状态的线程太多。线程将被创建以获取Map输出。除此之外,该网络可能会不堪重负。或者,如果有太多的中间输出需要洗牌,那么您的工作将会变慢,因为在这种情况下您将需要基于磁盘的洗牌,这将比基于RAM的洗牌慢。根据你的RAM选择一个明智的mapred.job.shuffle.input.buffer.percent值(默认为Reducer堆的70%,这通常很好)。所以,这些都会告诉你是否过度平行化。还有很多其他的事情你应该考虑。我建议您阅读Hadoop定义指南的第6章。



为提高工作效率,您可采取的一些措施是比如使用组合器来限制数据传输,启用中间压缩等。

HTH



PS:答案对于mapred.reduce.parallel.copies不是很具体。它会告诉你一般调整你的工作。其实说只设置这个属性是不会帮助你很多。您应该考虑其他重要的属性。


Following reading http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html we want to experiment with mapred.reduce.parallel.copies.

The blog mentions "looking very carefully at the logs". How would we know we've reached the sweet spot? what should we look for? how can we detect that we're over-parallelizing?

解决方案

In order to do that you should basically look for 4 things : CPU, RAM, Disk and Network. If your setup is crossing the threshold of these metrics you can deduce that you are pushing the limits. For example, if you have set the value of "mapred.reduce.parallel.copies" to a value much higher than the number of cores available, you'll end up with too many threads in waiting state, as based on this property Threads will be created to fetch the Map output. In addition to that network might get overwhelmed. Or, if there is too much intermediate output to be shuffled , your job will become slow as you will need disk based shuffle in such a case, which will be slower than RAM based shuffle. Choose a wise value for "mapred.job.shuffle.input.buffer.percent" based on your RAM(defaults to 70% of Reducer heap, which is normally good). So, these are kinda things which will tell you whether you are over-parallelizing or not. There are a lot of other things as well which you should consider. I would recommend you to go through the Chapter 6 of "Hadoop Definitve Guide".

Some of the measures which you could take, in order to make your jobs efficient, are like using a combiner to limit the data transfer, enable intermediate compression etc.

HTH

P.S : The answer is not very specific to just "mapred.reduce.parallel.copies". It tells you about tuning your job in general. Actually speaking setting only this property is not gonna help you much. You should consider other important properties as well.

这篇关于如何调整mapred.reduce.parallel.copies?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆