了解星火的缓存 [英] Understanding Spark's caching

查看:134
本文介绍了了解星火的缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解星火的缓存工作。

I'm trying to understand how Spark's cache work.

下面是我的幼稚理解,请让我知道,如果我失去了一些东西:

Here is my naive understanding, please let me know if I'm missing something:

val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")

在上面,RDD1集将从磁盘(例如HDFS)被加载一次。 (当RDD2保存我假设),然后再从缓存中(假设有足够的RAM)时rdd3保存)

In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved)

现在这里是我的问题。比方说,我想缓存RDD2和rdd3,因为他们都将在以后使用,但我在创建后,他们不需要RDD1集。

Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them.

基本上没有重复,不是吗?因为一旦RDD2和rdd3计算,我不需要再RDD1集,我也许应该unpersist,对吗?问题是什么时候?

Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when?

将这项工作? (选项A)

val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()

火花是否添加到DAG的unpersist电话吗?抑或是立即执行?如果它立即执行,那么基本上RDD1集将不缓存,当我从RDD2和rdd3阅读,对不对?

Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right?

我应该做这种方式,而不是(选项B)?

val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)

rdd2.cache()
rdd3.cache()

rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")

rdd1.unpersist()

所以,问题是这样的:
是选择A不够好?例如将RDD1集是仍在访问该文件只有一次?
或者,我需要去选择B?

So the question is this: Is Option A good enough? e.g. will rdd1 be still accessing the file only once? Or do I need to go with Option B?

推荐答案

这似乎是必需的选项B。究其原因,涉及到如何坚持/缓存和unpersist由星火执行。由于RDD的转换仅仅是建立DAG描述,不执行,在选项A通过调用unpersist的时候,你仍然只有工作描述,而不是一个运行的执行。

It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution.

这是相关的,因为一个缓存坚持通话只是增加RDD到一个地图,标明自己RDDS的作业执行过程中要坚持。然而, unpersist 直接告诉blockManager从存储驱逐RDD并删除在地图永久RDDS的参考。

This is relevant because a cache or persist call just adds the RDD to a Map of RDDs that marked themselves to be persisted during job execution. However, unpersist directly tells the blockManager to evict the RDD from storage and removes the reference in the Map of persistent RDDs.

<一个href=\"https://github.com/apache/spark/blob/b0d884f044fea1c954da77073f3556cd9ab1e922/core/src/main/scala/org/apache/spark/SparkContext.scala#L1306\">persist功能

<一个href=\"https://github.com/apache/spark/blob/b0d884f044fea1c954da77073f3556cd9ab1e922/core/src/main/scala/org/apache/spark/SparkContext.scala#L1313\">unpersist功能

所以,你会需要星火实际执行和模块管理存储在RDD后打电话unpersist。

So you would need to call unpersist after Spark actually executed and stored the RDD with the block manager.

朝着这个对 RDD.persist 方法暗示的意见:
<一href=\"https://github.com/apache/spark/blob/b0d884f044fea1c954da77073f3556cd9ab1e922/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L156\">rdd.persist

The comments for the RDD.persist method hint towards this: rdd.persist

这篇关于了解星火的缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆