Spark-范围,数据框架和内存管理 [英] Spark - Scope, Data Frame, and memory management

查看:99
本文介绍了Spark-范围,数据框架和内存管理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇范围如何与Data Frame和Spark一起使用.在下面的示例中,我有一个文件列表,每个文件都独立加载到数据帧中,执行一些操作,然后将 dfOutput 写入磁盘.

I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput to disk.

val files = getListOfFiles("outputs/emailsSplit")

for (file <- files){

   val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("delimiter","\t")          // Delimiter is tab
      .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
      .schema(customSchema)              // Schema of the table
      .load(file.toString)                        // Input file


   val dfOutput = df.[stuff happens]

    dfOutput.write.format("com.databricks.spark.csv").mode("overwrite").option("header", "true").save("outputs/sentSplit/sentiment"+file.toString+".csv") 

}

  1. 循环完成后, for循环中的每个数据帧都被丢弃了吗?还是留在内存中?
  2. 如果不将其丢弃,那么此时进行内存管理的更好方法是什么?
  1. Is each Data Frame inside the for loop discarded when a loop is done, or do they stay in memory?
  2. If they are not discarded, what is a better way to do memory management at this point?

推荐答案

DataFrame 对象很小.但是,它们可以引用Spark执行器上的缓存中的数据,并且可以引用Spark执行器上的随机播放文件.当 DataFrame 进行垃圾回收时,这还会导致在执行程序上删除缓存和混洗文件.

DataFrame objects are tiny. However they can reference data in cache on Spark executors, and they can reference shuffle files on Spark executors. When the DataFrame is garbage collected that also causes the cache and shuffle files to be deleted on the executors.

在您的代码中,循环后没有对DataFrame的引用.因此,它们是合格的垃圾收集器.垃圾收集通常是根据内存压力而发生的.如果您担心随机播放文件会填满磁盘,则可以触发一个显式GC,以确保删除不再引用的数据帧的随机播放文件.

In your code there are no references to the DataFrames past the loop. So they are eligible garbage collection. Garbage collection typically happens in response to memory pressure. If you worry about shuffle files filling up disk, it may make sense to trigger an explicit GC to make sure shuffle files are deleted for DataFrames that are no longer references.

取决于您对DataFrame所做的操作( [发生问题] ),可能是没有数据存储在内存中.这是Spark中的默认操作模式.如果您只想读取一些数据,对其进行转换并写回,那么它们将逐行发生,而永远不会将其存储在内存中.(仅当您明确要求时才进行缓存.)

Depending on what you do with the DataFrame ([stuff happens]) it may be that no data is ever stored in memory. This is the default mode of operation in Spark. If you just want to read some data, transform it, and write out back out, it will all happen line-by-line, never storing any of it in memory. (Caching only happens when you explicitly ask for it.)

所有这些,我建议您在遇到问题之前不要担心内存管理.

With all that, I suggest not worrying about memory management until you have problems.

这篇关于Spark-范围,数据框架和内存管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆