提高Hadoop中MapReduce作业性能的技巧 [英] Tips to improve MapReduce Job performance in Hadoop
问题描述
我有100个mapper和1个reducer在一份工作中运行。如何提高工作绩效?根据我的理解:使用组合器可以在很大程度上提高性能。但是,我们还需要配置哪些内容才能提高作业性能?解决方案
使用此问题中的有限数据(输入文件大小, HDFS块大小,平均地图处理时间,Mapper插槽数量和缩小簇中的插槽等),我们不能提示提示。
但是有一些通用的指导方针可以提高性能。
- 如果每个任务的持续时间少于 30-40秒,请减少任务数量
- 如果作业的输入大于1TB,请考虑将输入数据集的块大小增加到256M甚至512M,以便任务的数量只要每个任务至少运行30-40秒,将映射器任务的数量增加到映射器数量的几倍
- 每个作业的减少任务数量应该等于或少于群集中减少插槽的数量。
- 使用正确的诊断工具正确配置集群
- 在向磁盘写入中间数据时使用压缩
- 调整Map &安培;根据以上提示减少任务
- 适当地合并 Combiner
- 使用大多数输出值的范围在
整数
时输出(不要使用LongWritable
IntWritable
在这种情况下是正确的选择) - 重用
可写
- 拥有正确的分析工具
- If each task takes less than 30-40 seconds, reduce the number of tasks
- If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
- So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
- Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.
- Configure the cluster properly with right diagnostic tools
- Use compression when you are writing intermediate data to disk
- Tune number of Map & Reduce tasks as per above tips
- Incorporate Combiner wherever it is appropriate
- Use Most appropriate data types for rendering Output ( Do not use
LongWritable
when range of output values are inInteger
range.IntWritable
is right choice in this case) - Reuse
Writables
- Have right profiling tools
ol>
更多提示:
查看< a href =http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ =nofollow noreferrer> cloudera 文章了解更多提示。
I have 100 mapper and 1 reducer running in a job. How to improve the job performance?
As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?
With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.
But there are some general guidelines to improve the performance.
Some more tips :
Have a look at this cloudera article for some more tips.
这篇关于提高Hadoop中MapReduce作业性能的技巧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!