如果我在星火缓存相同RDD两次会发生什么 [英] What happens if I cache the same RDD twice in Spark

查看:221
本文介绍了如果我在星火缓存相同RDD两次会发生什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要建一个接收RDD并做一些计算上的通用函数。由于我运行输入RDD一个以上的计算,我想缓存它。例如:

I'm building a generic function which receives a RDD and does some calculations on it. Since I run more than one calculation on the input RDD I would like to cache it. For example:

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

我的问题是,因为研究是给我可能会或可能不会已经被缓存。如果缓存,我再次呼吁缓存它,会引发创建缓存意味着一个新层,虽然 T1 T2 的计算,我将有研究的两个实例在缓存中?或将引发知道的事实,即研究被缓存,并会忽略它?

My question is, since r is given to me it may or may not already be cached. If it is cached and I call cache on it again, will spark create a new layer of cache meaning that while t1 and t2 are calculated I will have two instances of r in the cache? or will spark is aware of the fact that r is cached and will ignore it?

推荐答案

没有即可。如果你调用缓存上缓存RDD,没有任何反应,RDD将被缓存(一次)。缓存,像其他许多变换,是懒惰的:

Nothing. If you call cache on a cached RDD, nothing happens, RDD will be cached (once). Caching, like many other transformations, is lazy:


  • 当你调用缓存中,RDD的 storageLevel 设置为 MEMORY_ONLY

  • 当你调用缓存再次,它被设置为相同的值(没有变化)

  • 评估后,当底层RDD被物化,星火将检查RDD的 storageLevel ,如果需要的缓存,它将缓存它。

  • When you call cache, the RDD's storageLevel is set to MEMORY_ONLY
  • When you call cache again, it's set to the same value (no change)
  • Upon evaluation, when underlying RDD is materialized, Spark will check the RDD's storageLevel and if it requires caching, it will cache it.

所以你是安全的。

这篇关于如果我在星火缓存相同RDD两次会发生什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆