Hadoop设置maxium同时映射/减少任务在Psedue模式下不起作用 [英] Hadoop setting maxium simultaneous map/reduce task does not work in Psedue mode

查看:86
本文介绍了Hadoop设置maxium同时映射/减少任务在Psedue模式下不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一台机器(4核)上配置了hadoop 2.4.1以使用Psedue分布式模式,并且我可以通过HDFS输入文件上的hadoop shell命令运行我的map / reduce程序。



但我注意到map和reduce看起来仍然在单线程中运行。所以我试图硬编码属性mapreduce.tasktracker.map.tasks.maximum和mapreduce.tasktracker.reduce.tasks.maximum,都为4.(只是为了尝试我知道这不是理想的设置)。但我仍然可以看到映射并减少了串行运行的任务。



我配置的方式是修改etc / hadoop / mapred-site.xml以包含以下内容:

 <配置> 
<属性>
<名称> mapreduce.tasktracker.map.tasks.maximum< / name>
<值> 4< /值>
< / property>

<属性>
<名称> mapreduce.tasktracker.reduce.tasks.maximum< / name>
<值> 4< /值>
< / property>
< / configuration>

然后使用命令重新启动TaskTracker节点

  sbin / hadoop-daemon.sh stop tasktracker 
sbin / hadoop-daemon.sh start tasktracker

接下来的文章在这里: https://www.ibm.com/developerworks/community/wikis/home?lang= en#!/ wiki / W265aa64a4f21_43ee_b236_c42a1c875961 / page / Tuning%20number%20of%20map%20and%20reduce%20slots%20on%20a%20TaskTracker%20node

我认为它仍然在单线程中运行的方式是通过重写构造函数来构建mapper对象或reduce对象时打印某些内容。然后,它显示映射器在时间映射器正在运行时一个接一个地构建,并且减法器也在整个时间内均匀构建。



我是什么在这里失踪?

解决方案

我发现在我使用的Hadoop版本中不再支持启动和停止TaskTracker。在这里和那里有两个很多不同的信息用于不同的版本,他们混在一起。



配置并启动纱线后,它看起来像地图和减少任务现在运行在某些并发中。 (根据 https:// hadoop设置.apache.org /文档/ r2.4.1 / Hadoop的项目 - 距离/ Hadoop的通用/ SingleCluster.html )。当运行一组更大的数据(大约运行2分钟)时,运行2个最大映射和2个最大减少可以带来大约10秒的改进,这是有道理的。



<对我来说,它也看起来像两个参数mapreduce.tasktracker.map.tasks.maximum& mapreduce.tasktracker.reduce.tasks.maximum不再生效,但我没有看到任何文件证实这一点。



而是,Yarn将所有控件资源管理,Slot的概念消失了,Container和VCore等等。如下所示的组合设置决定了节点可以如何并发运行。



http://www.cloudera.com/ content / cloudera / en / documentation / core / latest / topics / cdh_ig_yarn_tuning.html 这是我自己的理解,但需要更多的确认。

p>

I configured hadoop 2.4.1 in a single machine (4-core) to use the Psedue Distributed mode, and I am able to run my map/reduce program via the hadoop shell command on the HDFS input files.

But I notice that the map and reduce look like still running in single thread. So I tried to hard-code the properties mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum, both to 4. (Just for trying I know it is not ideal setting). But I still see the map and reduce tasks running in serial.

The way I configure is to modify the etc/hadoop/mapred-site.xml to include below:

<configuration>
    <property>
        <name> mapreduce.tasktracker.map.tasks.maximum </name>
        <value> 4 </value>
    </property>

    <property>
        <name> mapreduce.tasktracker.reduce.tasks.maximum </name>
        <value> 4 </value>
    </property>
</configuration>

And restart the TaskTracker node using command

sbin/hadoop-daemon.sh stop tasktracker
sbin/hadoop-daemon.sh start tasktracker

This follows the article here: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W265aa64a4f21_43ee_b236_c42a1c875961/page/Tuning%20number%20of%20map%20and%20reduce%20slots%20on%20a%20TaskTracker%20node

And the way that I conclude it stills run in single-thread, is that I try to print something when a mapper object or a reduce object is constructed, by overriding the constructor. Then it shows that the mappers are constructed one by one evenly across the time mappers are running, and the reducers constructed also one by one evenly across the time.

What am I missing here?

解决方案

I figured out that starting and stopping the TaskTracker is no longer supported in my used version of Hadoop. There are two many confused information here and there for different versions and they mixed up.

After I configure and start the Yarn, it really looks like the map and reduce tasks are now run in certain concurrency. (setting according to https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/SingleCluster.html). When running a larger set of data (about 2 minutes running), running in 2 maximum map and 2 maximum reduce can bring about 10 seconds of improvement, and this makes some sense.

And to me, it also looks like the two parameters mapreduce.tasktracker.map.tasks.maximum & mapreduce.tasktracker.reduce.tasks.maximum does not take effect any more, though I do not see any document confirming that.

And instead, the Yarn takes all controls of the resource management, the concept of Slot is gone and comes the Container, and VCore, etc. The combined settings as shown below, determines how concurrent a node can be run.

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_yarn_tuning.html

This is my own understanding yet, but need more confirmation.

这篇关于Hadoop设置maxium同时映射/减少任务在Psedue模式下不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆