Spark - 范围、数据帧和内存管理 [英] Spark - Scope, Data Frame, and memory management

查看:42
本文介绍了Spark - 范围、数据帧和内存管理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇 scope 如何与 Data Frame 和 Spark 配合使用.在下面的例子中,我有一个文件列表,每个文件独立加载到一个数据帧中,执行一些操作,然后,我们将 dfOutput 写入磁盘.

I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput to disk.

val files = getListOfFiles("outputs/emailsSplit")

for (file <- files){

   val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("delimiter","\t")          // Delimiter is tab
      .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
      .schema(customSchema)              // Schema of the table
      .load(file.toString)                        // Input file


   val dfOutput = df.[stuff happens]

    dfOutput.write.format("com.databricks.spark.csv").mode("overwrite").option("header", "true").save("outputs/sentSplit/sentiment"+file.toString+".csv") 

}

  1. 循环完成后,for 循环 中的每个数据帧是被丢弃,还是留在内存中?
  2. 如果它们没有被丢弃,此时进行内存管理的更好方法是什么?
  1. Is each Data Frame inside the for loop discarded when a loop is done, or do they stay in memory?
  2. If they are not discarded, what is a better way to do memory management at this point?

推荐答案

DataFrame 对象很小.但是他们可以在 Spark executor 上引用缓存中的数据,并且他们可以在 Spark executor 上引用 shuffle 文件.当 DataFrame 被垃圾收集时,也会导致缓存和 shuffle 文件在执行器上被删除.

DataFrame objects are tiny. However they can reference data in cache on Spark executors, and they can reference shuffle files on Spark executors. When the DataFrame is garbage collected that also causes the cache and shuffle files to be deleted on the executors.

在您的代码中,没有对循环后的 DataFrame 的引用.所以他们有资格进行垃圾收集.垃圾收集通常是为了响应内存压力而发生的.如果您担心 shuffle 文件填满磁盘,触发显式 GC 以确保删除不再引用的 DataFrame 的 shuffle 文件可能是有意义的.

In your code there are no references to the DataFrames past the loop. So they are eligible garbage collection. Garbage collection typically happens in response to memory pressure. If you worry about shuffle files filling up disk, it may make sense to trigger an explicit GC to make sure shuffle files are deleted for DataFrames that are no longer references.

根据您对 DataFrame 的处理方式([发生的事情]),可能没有数据存储在内存中.这是 Spark 中的默认操作模式.如果你只是想读取一些数据,转换它,然后写回,这一切都会一行一行地发生,永远不会将任何数据存储在内存中.(缓存仅在您明确要求时发生.)

Depending on what you do with the DataFrame ([stuff happens]) it may be that no data is ever stored in memory. This is the default mode of operation in Spark. If you just want to read some data, transform it, and write out back out, it will all happen line-by-line, never storing any of it in memory. (Caching only happens when you explicitly ask for it.)

尽管如此,我建议您在遇到问题之前不要担心内存管理.

With all that, I suggest not worrying about memory management until you have problems.

这篇关于Spark - 范围、数据帧和内存管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆