启动执行程序时,Spark on yarn是否处理数据局部性 [英] Does Spark on yarn deal with Data locality while launching executors

查看:92
本文介绍了启动执行程序时,Spark on yarn是否处理数据局部性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑火花执行器的静态分配. 启动执行程序时,Spark on yarn是否考虑了在Spark应用程序中使用的原始输入数据集的数据局部性.

I am considering static allocation of spark executor. Does Spark on yarn consider Data locality of raw input dataset getting used in spark application while launching executors.

如果它确实做到了这一点,那么将在初始化Spark上下文时请求并分配Spark执行器. Spark应用程序可能会使用多个原始输入数据集,这些原始数据集实际上可能驻留在许多不同的数据节点上.我们不能在所有这些节点上运行执行程序.

If it does take care of this how it does so as spark executor are requested and allocated when spark context gets initialized. There could be a chance that multiple raw input data set getting used in the spark application which could physically reside on many different data node. we can't run executor on all those node.

我了解spark在计划执行程序上的任务时会注意数据的局部性(如所述

I understand spark takes care of data locality while scheduling task on executor(as mentioned https://spark.apache.org/docs/latest/tuning.html#data-locality).

推荐答案

您说的没错

在执行程序上调度任务时,spark负责数据的局部性

spark takes care of data locality while scheduling task on executor

当Yarn启动执行程序时,它不知道您的数据在哪里.因此,在理想情况下,您可以在群集的所有节点上启动执行程序.但是,实际上,您仅在一部分节点上启动.

When Yarn launches an executor, it has no idea where your data is. So,in an ideal case, you launch executor on all nodes of your cluster. However, more realistically you launch then on only a subset of nodes.

现在,这并不一定是一件坏事,因为HDFS固有地支持冗余,这意味着有可能在火花已请求数据的节点上存在数据的副本.

Now, this is not necessarily a bad thing because HDFS inherently supports redundancy which means chances are there is a copy of the data present on the node that spark has requested the data on.

这篇关于启动执行程序时,Spark on yarn是否处理数据局部性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆