什么时候创建RDD血统?如何找到谱系图? [英] When does a RDD lineage is created? How to find lineage graph?

查看:56
本文介绍了什么时候创建RDD血统?如何找到谱系图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习Apache Spark,并尝试获取RDD的沿袭图. 但是我找不到何时创建特定血统? 另外,在哪里可以找到RDD的血统?

I am learning Apache Spark and trying to get the lineage graph of the RDDs. But i could not find when does a particular lineage is created? Also, where to find the lineage of an RDD?

推荐答案

RDD沿袭是每次对应用转换时都会创建和扩展的分布式计算的逻辑执行计划任何 RDD.

请注意逻辑"部分不是物理"的执行动作后就会发生这种情况.

Note the part "logical" not "physical" that happens after you've executed an action.

引用掌握Apache Spark 2 gitbook:

Quoting Mastering Apache Spark 2 gitbook:

RDD谱系(又名 RDD运算符图 RDD依赖图)是RDD的所有父RDD的图.它是通过对RDD进行转换而构建的,并创建了一个逻辑执行计划.

RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.

因此,RDD谱系图是调用动作后需要执行哪些转换的图.

A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

任何RDD都有RDD血统,即使这意味着RDD血统只是单个节点,即RDD本身.那是因为RDD可能是也可能不是一系列转换的结果(而且没有转换是零效应"转换:))

Any RDD has a RDD lineage even if that means that the RDD lineage is just a single node, i.e. the RDD itself. That's because an RDD may or may not be a result of a series of transformations (and no transformations is a "zero-effect" transformation :))

您可以使用

toDebugString:字符串对此RDD及其用于调试的递归依赖性的描述.

toDebugString: String A description of this RDD and its recursive dependencies for debugging.

val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []

val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
 +-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
    |  MapPartitionsRDD[1] at map at <console>:25 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

这篇关于什么时候创建RDD血统?如何找到谱系图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆