Spark'数据集不持久行为 [英] Spark' Dataset unpersist behaviour

查看:27
本文介绍了Spark'数据集不持久行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我看到了 Spark 的一些奇怪行为.

Recently I saw some strange behaviour of Spark.

我的应用程序中有一个管道,我在其中操作一个大数据集 - 伪代码:

I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode:

val data = spark.read (...)
data.join(df1, "key") //etc, more transformations
data.cache(); // used to not recalculate data after save
data.write.parquet() // some save

val extension = data.join (..) // more transformations - joins, selects, etc.
extension.cache(); // again, cache to not double calculations
extension.count();
// (1)
extension.write.csv() // some other save

extension.groupBy("key").agg(some aggregations) //
extension.write.parquet() // other save, without cache it will trigger recomputation of whole dataset

然而,当我调用 data.unpersist() 即就地 (1) 时,Spark 从 Storage 中删除所有数据集,还有 extension数据集不是我试图取消持久化的数据集.

However when I call data.unpersist() i.e. in place (1), Spark deletes from Storage all datasets, also the extension Dataset which is not the dataset I tried to unpersist.

这是预期的行为吗?如何通过 unpersist 在旧数据集上释放一些内存而不取消所有链中的下一个"数据集?

Is that an expected behaviour? How can I free some memory by unpersist on old Dataset without unpersisting all Dataset that was "next in chain"?

我的设置:

  • Spark 版本:当前主版本,RC 2.3
  • 斯卡拉:2.11
  • Java:OpenJDK 1.8

Question 看起来类似于 了解 Spark 的缓存,但在这里我在不坚持之前做了一些操作.起初我正在计算所有内容,然后保存到存储中 - 我不知道缓存在 RDD 中的工作方式是否与数据集相同

Question looks similar to Understanding Spark's caching, but here I'm doing some actions before unpersist. At first I'm counting everything and then save into storage - I don't know if caching works the same in RDD like in Datasets

推荐答案

这是 Spark 缓存的预期行为.Spark 不想保留无效的缓存数据.它完全删除了所有引用数据集的缓存计划.

This is an expected behavior from spark caching. Spark doesn't want to keep invalid cache data. It completely removes all the cached plans refer to the dataset.

这是为了确保查询正确.在示例中,您正在从缓存数据集 data 创建扩展 dataset.现在如果数据集data 是非持久化的,那么扩展数据集就不能再依赖于缓存的数据集data.

This is to make sure the query is correct. In the example you are creating extension dataset from cached dataset data. Now if the dataset data is unpersisted essentially extension dataset can no longer rely on the cached dataset data.

这里是他们修复的拉取请求.你可以看到类似的JIRA

Here is the Pull request for the fix they made. You can see similar JIRA ticket

这篇关于Spark'数据集不持久行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆