在Apache Spark中缓存RDD的目的是什么? [英] What is the purpose of cache an RDD in Apache Spark?

查看:134
本文介绍了在Apache Spark中缓存RDD的目的是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Apache Spark的新手,在Spark中有几个基本问​​题,我在阅读Spark资料时无法理解.每种材料都有自己的解释风格.我正在Ubuntu上使用PySpark Jupyter笔记本进行练习.

I am new for Apache Spark and I have couple of basic questions in spark which I could not understand while reading the spark material. Every materials have their own style of explanation. I am using PySpark Jupyter notebook on Ubuntu to practice.

据我了解,当我运行以下命令时,testfile.csv中的数据将被分区并存储在各个节点的内存中.命令),但概念仍然是

As per my understanding, When I run the below command, the data in the testfile.csv is partitioned and stored in memory of the respective nodes.( actually I know its a lazy evaluation and it will not process until it sees action command ), but still the concept is

rdd1 = sc.textFile("testfile.csv")

我的问题是,当我运行下面的transformation and action命令时,rdd2数据将存储在哪里.

My question is when I run the below transformation and action command, where does the rdd2 data will store.

1.它是否存储在内存中?

1.Does it stores in memory?

rdd2 = rdd1.map( lambda x: x.split(",") )

rdd2.count()

我知道rdd2中的数据将一直可用,直到我关闭jupyter笔记本为止.然后需要cache(),无论如何rdd2都可以进行所有转换.听说所有转换之后,内存中的数据都被清除了,这是怎么回事?

I know the data in rdd2 will available till I close the jupyter notebook.Then what is the need of cache(), anyhow rdd2 is available to do all transformation. I heard after all the transformation the data in memory is cleared, what is that about?

  1. 将RDD保留在内存和cache()之间是否有任何区别

  1. Is there any difference between keeping RDD in memory and cache()

rdd2.cache()

rdd2.cache()

推荐答案

它存储在内存中吗?

Does it stores in memory?

当您通过 action (countprintforeach)运行spark转换时,只有这样,图形才会被实现,在这种情况下,文件是被消耗. RDD.cache旨在确保sc.textFile("testfile.csv")的结果在内存中可用并且不需要再次读取.

When you run a spark transformation via an action (count, print, foreach), then, and only then is your graph being materialized and in your case the file is being consumed. RDD.cache purpose it to make sure that the result of sc.textFile("testfile.csv") is available in memory and isn't needed to be read over again.

不要将变量与在幕后进行的实际操作相混淆.缓存使您可以重新迭代数据,并确保它在内存中(如果有足够的内存将其完整地存储),如果您想重新迭代该RDD,并且设置正确, 存储级别(默认为StorageLevel.MEMORY). 摘自文档(感谢@RockieYang):

Don't confuse the variable with the actual operations that are being done behind the scenes. Caching allows you to re-iterate the data, making sure it is in memory (if there is sufficient memory to store it in it's entirety) if you want to re-iterate the said RDD, and as long as you've set the right storage level (which defaults to StorageLevel.MEMORY). From the documentation (Thanks @RockieYang):

此外,每个持久的RDD可以使用不同的存储方式 存储级别,例如,允许您将数据集持久存储在 磁盘,将其保留在内存中,但作为序列化的Java对象保存(保存 空间),将其复制到节点上或将其堆外存储在Tachyon中. 这些级别是通过传递StorageLevel对象(Scala,Java, Python)来坚持(). cache()方法是使用 默认存储级别,即StorageLevel.MEMORY_ONLY(存储 内存中反序列化的对象).

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

您可以使用persist()或cache()标记要保留的RDD 方法就可以了.第一次在动作中进行计算时,它将是 保留在节点上的内存中. Spark的缓存具有容错能力-如果有的话 RDD的分区丢失,它将使用重新自动计算 最初创建它的转换.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.


将RDD保留在内存和cache()之间是否有任何区别

Is there any difference between keeping RDD in memory and cache()

如上所述,只要您提供了正确的存储级别,就可以通过cache 将其保存在内存中.否则,在您要重用它时,它不一定会保留在内存中.

As stated above, you keep it in memory via cache, as long as you've provided the right storage level. Otherwise, it won't necessarily be kept in memory at the time you want to re-use it.

这篇关于在Apache Spark中缓存RDD的目的是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆