如何知道执行分区的工作人员? [英] How to know which worker a partition is executed at?
问题描述
我只是试图找到一种方法来获取Spark中RDD分区的位置.
I just try to find a way to get the locality of a RDD's partition in Spark.
在调用 RDD.repartition()
或 PairRDD.combineByKey()
之后,返回的RDD被分区.我想知道分区位于哪个工作实例(用于检查分区行为)?!
After calling RDD.repartition()
or PairRDD.combineByKey()
the returned RDD is partitioned. I'd like to know which worker instances the partitions are at (for examining the partition behaviour)?!
有人可以提供线索吗?
推荐答案
我确定这是一个有趣的问题,没有那么有趣的答案:)
An interesting question that I'm sure has not so much interesting answer :)
首先,将转换应用于RDD与工作实例无关,因为它们是单独的实体".转换会创建 RDD谱系(=逻辑计划),而执行者只有在执行了某个动作之后才上台执行(无双关语)(DAGScheduler将逻辑计划转换为执行计划,包括一系列阶段)任务).
First of all, applying transformations to your RDD has nothing to do with worker instances as they are separate "entities". Transformations create a RDD lineage (= a logical plan) while executors come to stage (no pun intended) only after an action is executed (and DAGScheduler transforms the logical plan into execution plan as a set of stages with tasks).
因此,我相信知道执行分区的执行程序的唯一方法是使用 org.apache.spark.SparkEnv 来访问与单个执行程序相对应的BlockManager.这正是Spark通过其BlockManager认识/跟踪执行者的方式.
So, I believe the only way to know what executor a partition is executed at is to use org.apache.spark.SparkEnv to access the BlockManager that corresponds to a single executor. That's exactly how Spark knows/tracks executors (by their BlockManagers).
您可以编写 org.apache.spark.scheduler.SparkListener 会拦截 onExecutorAdded
, onBlockManagerAdded
及其对应的 * Removed
对应对象,以了解如何将执行者映射到BlockManagers(但相信 SparkEnv
就足够了.)
You could write a org.apache.spark.scheduler.SparkListener that would intercept onExecutorAdded
, onBlockManagerAdded
and their *Removed
counterparts to know how to map executors to BlockManagers (but believe SparkEnv
is enough).
这篇关于如何知道执行分区的工作人员?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!