什么是Spark中的RDD依赖关系? [英] What is RDD dependency in Spark?

查看:73
本文介绍了什么是Spark中的RDD依赖关系?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,有两种类型的依赖项: narrow& .但我不了解依存关系如何影响子RDD . 子RDD 是仅包含如何从父RDD 构建新RDD块的信息的元数据吗?还是子RDD 是从父RDD 创建的自给自足的数据集?

As I know there are two types of dependencies: narrow & wide. But I dont understand how dependency affects to child RDD. Is child RDD only metadata which contains info how to build new RDD blocks from parent RDD? Or child RDD is self-sufficient set of data which was created from parent RDD?

推荐答案

是的,子RDD是元数据,描述了如何根据父RDD计算RDD.

Yes, the child RDD is metadata that describes how to calculate the RDD from the parent RDD.

考虑 例如:

private[spark]
class MappedRDD[U: ClassTag, T: ClassTag](prev: RDD[T], f: T => U)
  extends RDD[U](prev) {

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext) =
    firstParent[T].iterator(split, context).map(f)
}

当您说rdd2 = rdd1.map(...)时,rdd2就是这样的MappedRDD. compute仅在以后执行,例如,当您调用rdd2.collect时.

When you say rdd2 = rdd1.map(...), rdd2 will be such a MappedRDD. compute is only executed later, for example when you call rdd2.collect.

RDD始终是这样的元数据,即使它没有父级(例如,sc.textFile(...)). RDD唯一存储在节点上的情况是,如果使用rdd.cache将其标记为要进行缓存,然后导致对其进行计算.

An RDD is always such a metadata, even if it has no parents (for example sc.textFile(...)). The only case an RDD is stored on the nodes, is if you mark it for caching with rdd.cache, and then cause it to be computed.

另一个类似的情况是调用rdd.checkpoint.此功能将RDD标记为检查点.下次计算它时,它将被写入磁盘,以后访问RDD将导致它从磁盘中读取而不是重新计算.

Another similar situation is calling rdd.checkpoint. This function marks the RDD for checkpointing. The next time it is computed, it will be written to disk, and later access to the RDD will cause it to be read from disk instead of recalculated.

cachecheckpoint之间的区别在于,缓存的RDD仍保留其依赖性.缓存的数据可能会在内存压力下被丢弃,并且可能需要部分或全部重新计算.对于具有检查点的RDD不会发生这种情况,因此依赖项在那里被丢弃.

The difference between cache and checkpoint is that a cached RDD still retains its dependencies. The cached data can be discarded under memory pressure, and may need to be recalculated in part or whole. This cannot happen with a checkpointed RDD, so the dependencies are discarded there.

这篇关于什么是Spark中的RDD依赖关系?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆