提高Hadoop中MapReduce作业性能的技巧 [英] Tips to improve MapReduce Job performance in Hadoop

查看:108
本文介绍了提高Hadoop中MapReduce作业性能的技巧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有100个mapper和1个reducer在一份工作中运行。如何提高工作绩效?根据我的理解:使用组合器可以在很大程度上提高性能。但是,我们还需要配置哪些内容才能提高作业性能?解决方案

使用此问题中的有限数据(输入文件大小, HDFS块大小,平均地图处理时间,Mapper插槽数量和缩小簇中的插槽等),我们不能提示提示。

但是有一些通用的指导方针可以提高性能。


  1. 如果每个任务的持续时间少于 30-40秒,请减少任务数量

  2. 如果作业的输入大于1TB,请考虑将输入数据集的块大小增加到256M甚至512M,以便任务的数量只要每个任务至少运行30-40秒,将映射器任务的数量增加到映射器数量的几倍
  3. 每个作业的减少任务数量应该等于或少于群集中减少插槽的数量。

  4. ol>

    更多提示:


    1. 使用正确的诊断工具正确配置集群

    2. 在向磁盘写入中间数据时使用压缩
    3. 调整Map &安培;根据以上提示减少任务

    4. 适当地合并 Combiner
    5. 使用大多数输出值的范围在整数时输出(不要使用 LongWritable IntWritable 在这种情况下是正确的选择)

    6. 重用可写

    7. 拥有正确的分析工具

    查看< a href =http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ =nofollow noreferrer> cloudera 文章了解更多提示。


    I have 100 mapper and 1 reducer running in a job. How to improve the job performance?

    As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?

    解决方案

    With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.

    But there are some general guidelines to improve the performance.

    1. If each task takes less than 30-40 seconds, reduce the number of tasks
    2. If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
    3. So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
    4. Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.

    Some more tips :

    1. Configure the cluster properly with right diagnostic tools
    2. Use compression when you are writing intermediate data to disk
    3. Tune number of Map & Reduce tasks as per above tips
    4. Incorporate Combiner wherever it is appropriate
    5. Use Most appropriate data types for rendering Output ( Do not use LongWritable when range of output values are in Integer range. IntWritable is right choice in this case)
    6. Reuse Writables
    7. Have right profiling tools

    Have a look at this cloudera article for some more tips.

    这篇关于提高Hadoop中MapReduce作业性能的技巧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆