如何在hadoop 2.x中并行运行MapReduce任务? [英] How to run MapReduce tasks in Parallel with hadoop 2.x?

查看:78
本文介绍了如何在hadoop 2.x中并行运行MapReduce任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要我的地图并减少任务以并行运行.但是,尽管尝试了各种技巧,但它们仍按顺序运行.我从如何在Elastic MapReduce上的Hadoop 2.4.0中设置每个节点的并发运行任务的最大精确数量,使用以下公式,可以设置并行运行的任务数量.

I would like my map and reduce tasks to run in parallel. However, despite trying every trick in the bag, they are still running sequentially. I read from How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce, that using the following formula, one can set the number of tasks running in parallel.

min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, 
 yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores)

但是,我做到了,正如您在下面使用的 yarn-site.xml mapred-site.xml 所看到的那样.但是任务仍然按顺序运行.请注意,我使用的是开源Apache Hadoop,而不是Cloudera.迁移到Cloudera是否可以解决问题?另外请注意,我的输入文件足够大,以致 dfs.block.size 也不应该成为问题.

However, I did that, as you can see from the yarn-site.xml and mapred-site.xml I am using below. But the tasks still run sequentially. Note that I am using the open source Apache Hadoop and not Cloudera. Would shifting to Cloudera solve the problem? Also note that my input files are big enough that dfs.block.size should also not be an issue.

yarn-site.xml

    <configuration>
    <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>131072</value>
    </property>
    <property>
      <name>yarn.nodemanager.resource.cpu-vcores</name>
      <value>64</value>
    </property>
    </configuration>

mapred-site.xml

    <configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:9001</value>
    </property>

    <property>
      <name>mapreduce.map.memory.mb</name>
      <value>16384</value>
    </property>

    <property>
      <name>mapreduce.reduce.memory.mb</name>
      <value>16384</value>
    </property>

    <property>
        <name>mapreduce.map.cpu.vcores</name>
        <value>8</value>
    </property>

    <property>
        <name>mapreduce.reduce.cpu.vcores</name>
        <value>8</value>
    </property>
    </configuration>

推荐答案

容器是为在用户的每个节点上执行Map/Reduce任务保留的逻辑执行模板.

Container is the logical execution template reserved for the execution of Map/Reduce tasks on every node of the culster.

yarn.nodemanager.resource.memory-mb 属性告诉YARN资源管理器为要在节点中调度以执行Map/Reduce任务的所有容器保留大量的ram内存.这是将为每个容器保留的最大内存上限.

The yarn.nodemanager.resource.memory-mb property tells the YARN resource manager to reserve that much of ram memory for all containers to be dispatched in the node to execute Map/Reduce tasks. This is the maximum upper bound of the memory will be reserved for every container.

但是,在您这种情况下,节点中的可用内存接近11GB,并且您已将 yarn.nodemanager.resource.memory-mb 配置为将近128GB(131072), mapreduce.map.memory.mb & mapreduce.reduce.memory.mb 为16GB.Map/Reduce容器所需的上限大小为16Gb,大于11GB的可用内存*.这可能是您在节点中仅分配了一个容器用于执行的原因.

But in you case, the free memory in the node is almost 11GB, and you have configured yarn.nodemanager.resource.memory-mb to almost 128GB(131072) , mapreduce.map.memory.mb & mapreduce.reduce.memory.mb as 16GB . The required upper bound size for Map/Reduce containers is 16Gb wich is higher than 11GB of the free memory* . This could be a reason that you were allocated only one container in the node for execution.

我们将减少 mapreduce.map.memory.mb mapreduce.reduce.memory.mb 属性的值,使其小于可用内存的值,以获取多个容器并行运行.

We shall reduce the value of mapreduce.map.memory.mb , mapreduce.reduce.memory.mb properties than the value of free memory to get more than one container running in parallel.

还看到了一些增加可用内存的方法,因为它已经使用了90%以上的可用内存.

Also see some ways to increase the free memory since its already more 90% of it used.

希望这会有所帮助:) ..

Hope this helps :) ..

这篇关于如何在hadoop 2.x中并行运行MapReduce任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆