为什么hadoop对于简单的hello world工作很慢 [英] Why is hadoop slow for a simple hello world job

查看:199
本文介绍了为什么hadoop对于简单的hello world工作很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在hadoop网站上关注该教程: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html . 我在伪分布式模式下运行以下示例.

I am following the tutorial on the hadoop website: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html. I run the following example in Pseudo-Distributed Mode.

time hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+'

需要1:47分钟才能完成.当我关闭网络(wifi)时,它会在大约50秒内完成.

It takes 1:47min to complete. When I turn off the network (wifi), it finishes in approx 50 seconds.

当我使用本地(独立)模式运行相同的命令时,它会在大约5秒钟(在Mac上)中完成.

When I run the same command using the Local (Standalone) Mode, it finishes in approx 5 seconds (on a mac).

我了解到,在伪分布式模式下,将涉及更多开销,因此将花费更多时间,但是在这种情况下,它将花费更多时间.在运行期间,CPU完全处于空闲状态.

I understand that in Pseudo-Distributed Mode there is more overhead involved and hence it will take more time, but in this case it takes way more time. The CPU is completely idle during the run.

您知道什么会导致此问题吗?

Do you have any idea what can cause this issue?

推荐答案

首先,我没有解释为什么关闭网络会缩短时间的原因.您必须仔细阅读Hadoop日志才能找出问题所在.

First, I don't have an explanation for why turning off your network would result in faster times. You'd have to dig through the Hadoop logs to figure out that problem.

这是大多数人在单个节点上运行Hadoop时遇到的典型行为.实际上,您正在尝试使用Fedex将某些东西传递给您的隔壁邻居.由于运行分布式系统的内在开销,将始终更快.在运行本地模式时,您仅执行Map-Reduce功能.当您运行伪分布式时,它将使用所有Hadoop服务器(NameNode,用于数据的DataNode;资源管理器,用于计算的NodeManager),您所看到的是其中涉及的延迟.

This is typical behavior most people encounter running Hadoop on a single node. Effectively, you are trying to use Fedex to deliver something to your next door neighbor. It will always be faster to walk it over because the inherent overhead of operating a distributed system. When you run local mode, you are only performing the Map-Reduce function. When you run pseudo-distributed, it will use all the Hadoop servers (NameNode, DataNodes for data; Resource Manager, NodeManagers for compute) and what you are seeing is the latencies involved in that.

提交作业时,资源管理器必须安排作业.由于您的群集不忙,它将从节点管理器中请求资源.节点管理器将为它提供一个容器,该容器将运行您的Application Master.通常,此循环大约需要10秒钟.一旦AM运行,它将从资源管理器请求其Map和Reduce任务的资源.这又需要10秒钟.同样,当您提交作业时,大约需要等待3秒钟,才能将该作业实际提交给资源管理器.到目前为止,这是23秒,您还没有执行任何计算.

When you submit your job, the Resource Manager has to schedule it. As your cluster is not busy, it will ask for resources from the Node Manager. The Node Manager will give it a container which will run your Application Master. Typically, this loop takes about 10 seconds. Once your AM is running it will ask for resources from the Resource Manager for it's Map and Reduce tasks. This takes another 10 seconds. Also when you submit your job there is around a 3 second wait before this job is actually submitted to the Resource Manager. So far that's 23 seconds and you haven't done any computation yet.

作业运行后,最可能的等待原因是分配内存.在较小的系统(> 32GB内存)上,操作系统可能需要一段时间才能分配空间.如果要在Hadoop的商用硬件(16+核心,64 + GB)上运行相同的事情,则运行时间可能会接近25-30秒.

Once the job is running, the most likely cause of waiting is allocating memory. On smaller systems (> 32GB of memory) the OS might take a while to allocate space. If you were to run the same thing on what is considered commodity hardware for Hadoop (16+ core, 64+ GB) you would probably see run time closer to 25-30 seconds.

这篇关于为什么hadoop对于简单的hello world工作很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆