Spark的数据集不持久的行为 [英] Spark' Dataset unpersist behaviour

查看:71
本文介绍了Spark的数据集不持久的行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我看到了Spark的一些奇怪行为.

Recently I saw some strange behaviour of Spark.

我的应用程序中有一个管道,在其中处理一个大数据集-伪代码:

I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode:

val data = spark.read (...)
data.join(df1, "key") //etc, more transformations
data.cache(); // used to not recalculate data after save
data.write.parquet() // some save

val extension = data.join (..) // more transformations - joins, selects, etc.
extension.cache(); // again, cache to not double calculations
extension.count();
// (1)
extension.write.csv() // some other save

extension.groupBy("key").agg(some aggregations) //
extension.write.parquet() // other save, without cache it will trigger recomputation of whole dataset

但是,当我调用data.unpersist()时,即在原位置(1)时,Spark会从存储中删除所有数据集,还会删除extension数据集,这不是我尝试取消持久化的数据集.

However when I call data.unpersist() i.e. in place (1), Spark deletes from Storage all datasets, also the extension Dataset which is not the dataset I tried to unpersist.

这是预期的行为吗? unpersist如何在旧数据集上释放unpersist的内存,而又不保留下一个链"中的所有数据集?

Is that an expected behaviour? How can I free some memory by unpersist on old Dataset without unpersisting all Dataset that was "next in chain"?

我的设置:

  • 火花版本:当前的主版本,RC为2.3
  • 斯卡拉:2.11
  • Java:OpenJDK 1.8

问题看起来类似于了解Spark的缓存,但是在此,我在不持久之前做一些操作.首先,我要计算所有内容,然后保存到存储中-我不知道缓存是否在RDD中像在数据集中一样工作

Question looks similar to Understanding Spark's caching, but here I'm doing some actions before unpersist. At first I'm counting everything and then save into storage - I don't know if caching works the same in RDD like in Datasets

推荐答案

这是Spark缓存的预期行为. Spark不想保留无效的缓存数据.它会完全删除所有引用数据集的缓存计划.

This is an expected behavior from spark caching. Spark doesn't want to keep invalid cache data. It completely removes all the cached plans refer to the dataset.

这是为了确保查询正确.在示例中,您将根据缓存的数据集data创建扩展名dataset.现在,如果数据集data是非持久性的,则扩展数据集将不再可以依赖于缓存的数据集data.

This is to make sure the query is correct. In the example you are creating extension dataset from cached dataset data. Now if the dataset data is unpersisted essentially extension dataset can no longer rely on the cached dataset data.

此处是对其进行的修复的请求请求.您可以看到类似的 JIRA 票据

Here is the Pull request for the fix they made. You can see similar JIRA ticket

这篇关于Spark的数据集不持久的行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆