如果我在星火缓存相同RDD两次会发生什么 [英] What happens if I cache the same RDD twice in Spark
问题描述
我要建一个接收RDD并做一些计算上的通用函数。由于我运行输入RDD一个以上的计算,我想缓存它。例如:
I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:
public JavaRDD<String> foo(JavaRDD<String> r) {
r.cache();
JavaRDD t1 = r... //Some calculations
JavaRDD t2 = r... //Other calculations
return t1.union(t2);
}
我的问题是,因为研究
是给我可能会或可能不会已经被缓存。如果缓存,我再次呼吁缓存它,会引发创建缓存意味着一个新层,虽然 T1
和 T2
的计算,我将有研究
的两个实例在缓存中?或将引发知道的事实,即研究
被缓存,并会忽略它?
My question is, since r
is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1
and t2
are calculated I will have two instances of r
in the cache? or will spark is aware of the fact that r
is cached and will ignore it?
推荐答案
没有即可。如果你调用缓存
上缓存RDD,没有任何反应,RDD将被缓存(一次)。缓存,像其他许多变换,是懒惰的:
Nothing. If you call cache
on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:
- 当你调用
缓存
中,RDD的storageLevel
设置为MEMORY_ONLY
- 当你调用
缓存
再次,它被设置为相同的值(没有变化) - 评估后,当底层RDD被物化,星火将检查RDD的
storageLevel
,如果需要的缓存,它将缓存它。
- When you call
cache
, the RDD'sstorageLevel
is set toMEMORY_ONLY
- When you call
cache
again, it's set to the same value (no change) - Upon evaluation, when underlying RDD is materialized, Spark will check the RDD's
storageLevel
and if it requires caching, it will cache it.
所以你是安全的。
这篇关于如果我在星火缓存相同RDD两次会发生什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!