Spark'数据集不持久行为 [英] Spark' Dataset unpersist behaviour

查看：27 发布时间：2021/11/14 21:42:05 apache-spark apache-spark-sql

本文介绍了Spark'数据集不持久行为的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

最近我看到了 Spark 的一些奇怪行为.

Recently I saw some strange behaviour of Spark.

我的应用程序中有一个管道，我在其中操作一个大数据集 - 伪代码:

I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode:

val data = spark.read (...)
data.join(df1, "key") //etc, more transformations
data.cache(); // used to not recalculate data after save
data.write.parquet() // some save

val extension = data.join (..) // more transformations - joins, selects, etc.
extension.cache(); // again, cache to not double calculations
extension.count();
// (1)
extension.write.csv() // some other save

extension.groupBy("key").agg(some aggregations) //
extension.write.parquet() // other save, without cache it will trigger recomputation of whole dataset

然而，当我调用 data.unpersist() 即就地 (1) 时，Spark 从 Storage 中删除所有数据集，还有 extension数据集不是我试图取消持久化的数据集.

However when I call data.unpersist() i.e. in place (1), Spark deletes from Storage all datasets, also the extension Dataset which is not the dataset I tried to unpersist.

这是预期的行为吗?如何通过 unpersist 在旧数据集上释放一些内存而不取消所有链中的下一个"数据集?

Is that an expected behaviour? How can I free some memory by unpersist on old Dataset without unpersisting all Dataset that was "next in chain"?

我的设置:

Spark 版本:当前主版本，RC 2.3
斯卡拉:2.11
Java:OpenJDK 1.8

Question 看起来类似于了解 Spark 的缓存，但在这里我在不坚持之前做了一些操作.起初我正在计算所有内容，然后保存到存储中 - 我不知道缓存在 RDD 中的工作方式是否与数据集相同

Question looks similar to Understanding Spark's caching, but here I'm doing some actions before unpersist. At first I'm counting everything and then save into storage - I don't know if caching works the same in RDD like in Datasets

Spark'数据集不持久行为 [英] Spark' Dataset unpersist behaviour

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark'数据集不持久行为 [英] Spark&#39; Dataset unpersist behaviour

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark'数据集不持久行为 [英] Spark' Dataset unpersist behaviour

登录关闭