Hadoop:映射器和缩减器的数量 [英] Hadoop: Number of mappers and reducers

查看:216
本文介绍了Hadoop:映射器和缩减器的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用不同数量的映射器和缩减器(例如1个映射器和1个缩减器,1个映射器和2个缩减器,1个映射器和4个缩减器,...),在1.1GB文件上多次运行Hadoop MapReduce。 b
$ b

Hadoop安装在具有超线程功能的四核机器上。

以下是按最短执行时间排序的前5个结果:

  + ---------- + ---------- +  - -------- + 
|时间| #地图| #红色|
+ ---------- + ---------- + ---------- +
| 7m 50s | 8 | 2 |
| 8m 13s | 8 | 4 |
| 8m 16s | 8 | 8 |
| 8m 28s | 4 | 8 |
| 8分37秒| 4 | 4 |
+ ---------- + ---------- + ---------- +



编辑



1至8个缩减器和1至8个映射器的结果:
列=映射器的数量
行=减速器数量

  + --------- +  - -------- + --------- + --------- + --------- + 
| | 1 | 2 | 4 | 8 |
+ --------- + --------- + --------- + --------- + ----- ---- +
| 1 | 16:23 | 13:17 | 11:27 | 10:19 |
+ --------- + --------- + --------- + --------- + ----- ---- +
| 2 | 13:56 | 10:24 | 08:41 | 07:52 |
+ --------- + --------- + --------- + --------- + ----- ---- +
| 4 | 14:12 | 10:21 | 08:37 | 08:13 |
+ --------- + --------- + --------- + --------- + ----- ---- +
| 8 | 14:09 | 09:46 | 08:28 | 08:16 |
+ --------- + --------- + --------- + --------- + ----- ---- +

(1)
看起来程序运行速度略快我有8个映射器,但为什么它减慢了,因为我增加了reducer的数量? (例如8mappers / 2reducers比8mappers / 8reducers更快)

(2)
当我仅使用4个映射器时,速度稍慢,仅仅因为我没有使用其他4个内核,对吗?

解决方案

mappers和reducer的最佳数量与许多事情有关。

主要目的是平衡使用的CPU功率,传输的数据量(在映射器中,映射器和缩减器之间以及减少器之间)和磁盘的'磁头移动'。

如果mapreduce作业能够以最小磁盘磁头移动读取/写入数据,那么每个任务的效果最佳。通常被描述为顺序读取/写入。但是如果任务是CPU绑定的,额外的磁头移动不会影响工作。



在我看来,在这种特定情况下,您有


  • 一个映射器,它可以执行相当多的CPU周期(即更多映射器使映射器变得更快,因为CPU是瓶颈,磁盘可以继续提供输入数据)。
  • 一个几乎没有CPU周期的reducer,并且主要是IO bound。这会导致单个reducer仍然是CPU绑定的,但是使用4个或更多个reducer似乎是IO绑定的。因此,4个reducer会导致磁盘头移动太多。



处理这种情况的可能方法:



首先完成您所做的:做一些测试运行,并查看哪些设置在给定此特定作业和特定群集的情况下执行得最好。



然后你有三个选项:


  • 接受你的情况

  • 从CPU移动加载
  • 获得更大的群集:更多CPU和/或更多磁盘。



关于移动负载的建议:


  • 如果CPU绑定并且所有CPU都满载,则减少CPU负载:


    • 检查代码中不需要的CPU周期。

    • 切换到'较低的CPU影响'压缩编解码器:即从GZip转到Snappy或无压缩。

    • 调整作业中的映射器/缩减器的数量。
    • 如果IO绑定,并且您还剩有一些CPU容量:


      • 启用压缩:这使得CPU工作有点困难,并减少磁盘工作的工作。 试用各种压缩编解码器(我建议坚持使用Snappy或Gzip ......我经常使用Gzip)。

      • 调整作业中的mappers / reducers的数量。



    I ran Hadoop MapReduce on 1.1GB file multiple times with a different number of mappers and reducers (e.g. 1 mapper and 1 reducer, 1 mapper and 2 reducers, 1 mapper and 4 reducers, ...)

    Hadoop is installed on quad-core machine with hyper-threading.

    The following is the top 5 result sorted by shortest execution time:

    +----------+----------+----------+
    |  time    | # of map | # of red |
    +----------+----------+----------+
    | 7m 50s   |    8     |    2     |
    | 8m 13s   |    8     |    4     |
    | 8m 16s   |    8     |    8     |
    | 8m 28s   |    4     |    8     |
    | 8m 37s   |    4     |    4     |
    +----------+----------+----------+
    

    Edit

    The result for 1 - 8 reducers and 1 - 8 mappers: column = # of mappers row = # of reducers

    +---------+---------+---------+---------+---------+
    |         |    1    |    2    |    4    |    8    |
    +---------+---------+---------+---------+---------+
    |    1    |  16:23  |  13:17  |  11:27  |  10:19  |
    +---------+---------+---------+---------+---------+
    |    2    |  13:56  |  10:24  |  08:41  |  07:52  |
    +---------+---------+---------+---------+---------+
    |    4    |  14:12  |  10:21  |  08:37  |  08:13  |  
    +---------+---------+---------+---------+---------+
    |    8    |  14:09  |  09:46  |  08:28  |  08:16  |
    +---------+---------+---------+---------+---------+
    

    (1) It looks that the program runs slightly faster when I have 8 mappers, but why does it slow down as I increase the number of reducers? (e.g. 8mappers/2reducers is faster than 8mappers/8reducers)

    (2) When I use only 4 mappers, it's a bit slower simply because I'm not utilizing the other 4 cores, right?

    解决方案

    The optimal number of mappers and reducers has to do with a lot of things.

    The main thing to aim for is the balance between the used CPU power, amount of data that is transported (in mapper, between mapper and reducer, and out the reducers) and the disk 'head movements'.

    Each task in a mapreduce job works best if it can read/write the data 'with minimal disk head movements'. Usually described as "sequential reads/writes". But if the task is CPU bound the extra diskhead movements do not impact the job.

    It seems to me that in this specific case you have

    • a mapper that does quite a bit of CPU cycles (i.e. more mappers make it go faster because the CPU is the bottle neck and the disks can keep up in providing the input data).
    • a reducer that does almost no CPU cycles and is mostly IO bound. This causes that with a single reducer you are still CPU bound, yet with 4 or more reducers you seem to be IO bound. So 4 reducers cause the disk head to move 'too much'.

    Possible ways to handle this kind of situation:

    First do exactly what you did: Do some test runs and see which setting performs best given this specific job and your specific cluster.

    Then you have three options:

    • Accept the situation you have
    • Shift load from CPU to disk or the other way around.
    • Get a bigger cluster: More CPUs and/or more disks.

    Suggestions for shifting the load:

    • If CPU bound and all CPUs are fully loaded then reduce the CPU load:

      • Check for needless CPU cycles in your code.
      • Switch to a 'lower CPU impact' compression codec: I.e. go from GZip to Snappy or to 'no compression'.
      • Tune the number of mappers/reducers in your job.
    • If IO bound and you have some CPU capacity left:

      • Enable compression: This makes the CPUs work a bit harder and reduces the work the disks have to do.
      • Experiment with various compression codecs (I recommend sticking with either Snappy or Gzip ... I very often go with Gzip).
      • Tune the number of mappers/reducers in your job.

    这篇关于Hadoop:映射器和缩减器的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆