YARN如何在集群模式下了解Apache Spark中的数据局部性 [英] How YARN knows data locality in Apache spark in cluster mode

查看:64
本文介绍了YARN如何在集群模式下了解Apache Spark中的数据局部性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设有一个Spark作业将从HDFS读取一个名为records.txt的文件,并进行一些转换和一个操作(将处理后的输出写入HDFS).作业将被提交到YARN集群模式

还要假设records.txt是一个128 MB的文件,并且其中一个HDFS复制块也位于NODE 1中

可以说YARN分配的是NODE 1内部的执行程序.

YARN如何在输入数据所在的节点中准确分配执行程序?

谁告诉YARN,NODE 1中提供了records.txt复制的HDFS块之一?

Spark Application如何找到数据位置?是由运行在Application Master中的驱动程序完成的吗?

YARN是否知道数据位置?

解决方案

这里的基本问题是:

YARN是否知道数据位置?

YARN知道"什么应用程序告诉它,并且了解群集的结构(拓扑).当应用程序发出资源请求时,它可以包括特定的位置约束,分配资源时可能会满足,也可能不会满足.

如果无法指定约束,则YARN(或任何其他集群管理器)将基于其对集群拓扑的了解,尝试提供最佳的替代匹配.

那么应用程序如何知道" ?

如果应用程序使用支持某种形式的数据局部性的输入源(文件系统或其他),则可以向其查询相应的目录(如果是HDFS,则为namenode)以获取要访问的数据块的位置./p>

从广义上讲,Spark RDD可以根据特定的RDD实现定义preferredLocations,稍后可以将其转换为集群管理器(不一定是YARN)的资源约束.

Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). The job will be submitted to YARN cluster mode

Assume also that records.txt is a file of 128 MB and one of its HDFS replicated blocks is also in NODE 1

Lets say YARN is allocating is a executor inside NODE 1 .

How does YARN allocates a executor exactly in a node where the input data is located?

Who tells YARN that one of the replicated HDFS block of records.txt is available in NODE 1 ?

How the data localilty is found By Spark Application ? Is it done by Driver which runs inside Application Master ?

Does YARN know about the datalocality ?

解决方案

The fundamental question here is:

Does YARN know about the datalocality ?

YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.

If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.

So how application "knows"?

If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.

In broader sense Spark RDD can define preferredLocations, depending on a specific RDD implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).

这篇关于YARN如何在集群模式下了解Apache Spark中的数据局部性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆