如何让Hadoop使用我系统上的所有内核? [英] How to make Hadoop use all the cores on my system?

查看:64
本文介绍了如何让Hadoop使用我系统上的所有内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个32核心系统。当我使用Hadoop运行MapReduce作业时,我从来没有看到java进程使用超过150%的CPU(根据上图),并且它通常保持在100%左右。它应该更接近3200%。

哪些属性需要更改(以及在哪个文件中)才能启用更多工作人员?

解决方案

可能有两个问题,我在下面概述。我还想指出这是一个非常常见的问题,您应该查看以前提出的Hadoop问题。






<您可以将 mapred.tasktracker.map.tasks.maximum 设置为 conf / mapred-site.xml 。如果您检查JobTracker时,会出现这个问题,您会看到几个未完成的任务,但只有几个正在运行的任务。每个任务都是单个线程,因此假设您需要该节点上的32个最大插槽。






否则,您的数据没有被分成足够的块。你是否在运行少量的数据?这可能是因为您的MapReduce作业仅运行少量输入分割,因此不需要更多映射器。试着在数百MB的数据上运行你的工作,看看你是否仍然有同样的问题。
Hadoop自动分割您的文件。文件分割成的块的数量是文件的总大小除以块大小。默认情况下,一个地图任务将被分配给每个块(不是每个文件)。



在您的 conf / hdfs-site.xml 配置文件,有一个 dfs.block.size参数。大多数人将其设置为64或128mb。然而,如果你想要做的事情很小,你可以设置它来分割工作。



您也可以手动将文件拆分为32个块。


I have a 32 core system. When I run a MapReduce job using Hadoop I never see the java process use more than 150% CPU (according to top) and it usually stays around the 100% mark. It should be closer to 3200%.

Which property do I need to change (and in which file) to enable more workers?

解决方案

There could be two issues, which I outline below. I'd also like to point out that this is a very common question and you should look at the previously asked Hadoop questions.


Your mapred.tasktracker.map.tasks.maximum could be set low in conf/mapred-site.xml. This will be the issue if when you check the JobTracker, you see several pending tasks, but only a few running tasks. Each task is a single thread, so you would hypothetically need 32 maximum slots on that node.


Otherwise, likely your data is not being split into enough chunks. Are you running over a small amount of data? It could be that your MapReduce job is running over only a few input splits and thus does not require more mappers. Try running your job over hundreds of MB of data instead and see if you still have the same issue. Hadoop automatically splits your files. The number of blocks a file is split up into is the total size of the file divided by the block size. By default, one map task will be assigned to each block (not each file).

In your conf/hdfs-site.xml configuration file, there is a dfs.block.size parameter. Most people set this to 64 or 128mb. However, if you are trying to do something tiny you could set this up to split up the work more.

You can also manually split your file into 32 chunks.

这篇关于如何让Hadoop使用我系统上的所有内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆