Amazon Emr-当我们拥有核心节点时,任务节点有什么需求? [英] Amazon Emr - What is the need of Task nodes when we have Core nodes?

查看:131
本文介绍了Amazon Emr-当我们拥有核心节点时,任务节点有什么需求?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近正在学习有关Amazon EMR的知识,据我所知,EMR集群使我们可以选择3个节点.

I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes.

  1. 主服务器,用于运行主要的Hadoop守护进程,如NameNode,Job Tracker和资源管理器.
  2. 运行Datanode和Tasktracker守护程序的核心.
  3. 仅运行TaskTracker的任务.

我对你们的问题是,为什么EMR提供了任务节点?如hadoop建议的那样,我们应该在同一节点上具有Datanode守护程序和Tasktracker守护程序.亚马逊这样做的逻辑是什么?您可以将数据保留在S3中,然后将其流传输到核心节点上的HDFS,在HDFS上进行处理,而不是将数据从HDFS共享到任务节点,这将增加IO的开销.因为据我对hadoop的了解,TaskTrackers在具有特定任务数据块的DataNode上运行,那么为什么TaskTrackers在不同的节点上?

My question to you guys in why does EMR provide task nodes? Where as hadoop suggests that we should have Datanode daemon and Tasktracker daemon on the same node. What is Amazon's logic behind doing this? You can keep data in S3 stream it to HDFS on the core nodes, do the processing on HDFS other than sharing data from HDFS to task nodes which will increase IO over head in that case. Because as far as my knowledge in hadoop, TaskTrackers run on DataNodes which have data blocks for that particular task then why have TaskTrackers on different nodes?

推荐答案

根据AWS文档[1]

According to AWS documentation [1]

Amazon EMR中的节点类型如下:主节点:通过运行软件来管理集群的节点组件以协调数据和任务之间的分配其他节点进行处理.主节点跟踪任务状态并监视群集的运行状况.每个集群都有一个主节点,并且可以仅使用主节点.

The node types in Amazon EMR are as follows: Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node, and it's possible to create a single-node cluster with only the master node.

核心节点:具有软件组件的节点,该组件可以运行任务并将数据存储在您的Hadoop分布式文件系统(HDFS)中簇.多节点群集至少具有一个核心节点.

Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.

任务节点:具有仅运行任务且不将数据存储在HDFS中的软件组件的节点.任务节点是可选的.

Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.

根据AWS文档[2]

According to AWS documentation [2]

任务节点是可选的.您可以使用它们来增强对数据执行并行计算任务的能力,例如Hadoop MapReduce任务和Spark执行程序.

Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors.

任务节点不运行数据节点守护程序,也不将数据存储在HDFS中.

Task nodes don't run the Data Node daemon, nor do they store data in HDFS.

一些用例是:

  • 您可以使用任务"节点来处理来自S3的流.在这种情况下,网络IO不会增加,因为使用的数据不在HDFS上.
  • 可以在没有运行HDFS守护程序的情况下添加或删除任务节点.因此,任务节点上没有数据.核心节点运行HDFS守护程序,并不断添加和删除新节点不是一个好习惯.

资源:

[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters

[2] 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆