为什么SPARK缓存的RDD溢出到磁盘上? [英] Why SPARK cached RDD spill to disk?

查看:114
本文介绍了为什么SPARK缓存的RDD溢出到磁盘上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下一个代码,在其中对分区过滤后的输入数据进行分区并保存:

I have next code, where I am repartition filtered input data and persist it:

val df = sparkSession.sqlContext.read
      .parquet(path)
      .as[struct1]
      .filter(dateRange(_,lowerBound,upperBound))
      .repartition(nrInputPartitions)
      .persist()

df.count

我希望所有数据都存储在内存中,但是我在Spark UI中得到了以下内容:

I expect all data to be stored in Memory, but instead I get the following in Spark UI:

存储

Size in Memory   424.2 GB 
Size on Disk     44.1 GB

是因为某些分区没有足够的内存,并且Spark自动切换到 MEMORY_AND_DISK 存储级别?

Is it because some partition didn't have enough Memory, and Spark automatically switched to MEMORY_AND_DISK storage level?

推荐答案

是因为某些分区没有足够的内存,并且Spark自动切换到MEMORY_AND_DISK存储级别吗?

Is it because some partition didn't have enough Memory, and Spark automatically switched to MEMORY_AND_DISK storage level?

差不多.这是因为它不是 RDD ,而是 Dataset ,并且 Datasets 的默认存储级别是 MEMORY_AND_DISK .否则,您的怀疑是对的-如果没有足够的内存或需要将缓存逐出,则数据将存入磁盘(但从技术上讲,它不是aspill).

Almost. It is because it is not RDD, but Dataset, and default storage level for Datasets is MEMORY_AND_DISK. Otherwise your suspicion is true - if there is not enough memory or cache eviction is required data goes to disk (but technically speaking it is not aspill).

这篇关于为什么SPARK缓存的RDD溢出到磁盘上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆