如何确保我的 DataFrame 释放内存? [英] How to make sure my DataFrame frees its memory?

查看:356
本文介绍了如何确保我的 DataFrame 释放内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一份 Spark/Scala 工作,我在其中执行此操作:

I have a Spark/Scala job in which I do this:

  • 1:计算一个大的DataFrame df1 + cache 到内存中
  • 2:使用df1计算dfA
  • 3:将原始数据读入df2(再次,它很大)+ cache it
  • 1: Compute a big DataFrame df1 + cache it into memory
  • 2: Use df1 to compute dfA
  • 3: Read raw data into df2 (again, its big) + cache it

在执行 (3) 时,我不再需要 df1.我想确保它的空间得到释放.我在 (1) 处缓存,因为这个 DataFrame 在 (2) 中被使用,这是确保我不会每次都重新计算它而只重新计算一次的唯一方法.

When performing (3), I do no longer need df1. I want to make sure its space gets freed. I cached at (1) because this DataFrame gets used in (2) and its the only way to make sure I do not recompute it each time but only once.

我需要释放它的空间并确保它被释放.我有哪些选择?

I need to free its space and make sure it gets freed. What are my options?

我想到了这些,但似乎还不够:

I thought of these, but it doesn't seem to be sufficient:

  • df=null
  • df.unpersist()

你能用正确的 Spark 文档链接记录你的答案吗?

Can you document your answer with a proper Spark documentation link?

推荐答案

df.unpersist 应该就足够了,但它不一定会立即释放它.它只是标记要删除的数据帧.

df.unpersist should be sufficient, but it won't necessarily free it right away. It merely marks the dataframe for removal.

您可以使用 df.unpersist(blocking = true) 它将阻塞直到数据帧被删除,然后再继续.

You can use df.unpersist(blocking = true) which will block until the dataframe is removed before continuing on.

这篇关于如何确保我的 DataFrame 释放内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆