什么时候需要持久化以及什么时候需要取消对RDD的持久化 [英] When to persist and when to unpersist RDD in Spark
问题描述
让我说以下内容:
val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK)
val dataset3 = dataset2.map(.....)
如果您对数据集2进行了转换,那么您必须将其持久化并将其传递给数据集3并取消保留先前的数据吗?
If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist the previous or not?
我正在尝试确定何时保留和持久保留RDD.每创建一个新的rdd,我都必须坚持吗?
I am trying to figure out when to persist and unpersist RDDs. With every new rdd that is created do i have to persist it?
谢谢
推荐答案
Spark自动监视每个节点上的缓存使用情况,并以最近最少使用(LRU)的方式丢弃旧的数据分区.如果要手动删除RDD而不是等待它脱离缓存,请使用RDD.unpersist()方法.
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
引用来自: http://spark. apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
这篇关于什么时候需要持久化以及什么时候需要取消对RDD的持久化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!