Spark的数据集不持久的行为 [英] Spark' Dataset unpersist behaviour

查看：71 发布时间：2020/9/4 1:16:05 apache-spark apache-spark-sql

本文介绍了Spark的数据集不持久的行为的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

最近我看到了Spark的一些奇怪行为.

Recently I saw some strange behaviour of Spark.

我的应用程序中有一个管道，在其中处理一个大数据集-伪代码:

I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode:

val data = spark.read (...)
data.join(df1, "key") //etc, more transformations
data.cache(); // used to not recalculate data after save
data.write.parquet() // some save

val extension = data.join (..) // more transformations - joins, selects, etc.
extension.cache(); // again, cache to not double calculations
extension.count();
// (1)
extension.write.csv() // some other save

extension.groupBy("key").agg(some aggregations) //
extension.write.parquet() // other save, without cache it will trigger recomputation of whole dataset

但是，当我调用data.unpersist()时，即在原位置(1)时，Spark会从存储中删除所有数据集，还会删除extension数据集，这不是我尝试取消持久化的数据集.

However when I call data.unpersist() i.e. in place (1), Spark deletes from Storage all datasets, also the extension Dataset which is not the dataset I tried to unpersist.

这是预期的行为吗? unpersist如何在旧数据集上释放unpersist的内存，而又不保留下一个链"中的所有数据集?

Is that an expected behaviour? How can I free some memory by unpersist on old Dataset without unpersisting all Dataset that was "next in chain"?

我的设置:

火花版本:当前的主版本，RC为2.3
斯卡拉:2.11
Java:OpenJDK 1.8

问题看起来类似于了解Spark的缓存，但是在此，我在不持久之前做一些操作.首先，我要计算所有内容，然后保存到存储中-我不知道缓存是否在RDD中像在数据集中一样工作

Question looks similar to Understanding Spark's caching, but here I'm doing some actions before unpersist. At first I'm counting everything and then save into storage - I don't know if caching works the same in RDD like in Datasets

Spark的数据集不持久的行为 [英] Spark' Dataset unpersist behaviour

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark的数据集不持久的行为 [英] Spark&#39; Dataset unpersist behaviour

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark的数据集不持久的行为 [英] Spark' Dataset unpersist behaviour

登录关闭