为什么SPARK缓存的RDD溢出到磁盘上? [英] Why SPARK cached RDD spill to disk?
问题描述
我有下一个代码,在其中对分区过滤后的输入数据进行分区并保存:
I have next code, where I am repartition filtered input data and persist it:
val df = sparkSession.sqlContext.read
.parquet(path)
.as[struct1]
.filter(dateRange(_,lowerBound,upperBound))
.repartition(nrInputPartitions)
.persist()
df.count
我希望所有数据都存储在内存中,但是我在Spark UI中得到了以下内容:
I expect all data to be stored in Memory, but instead I get the following in Spark UI:
存储
Size in Memory 424.2 GB
Size on Disk 44.1 GB
是因为某些分区没有足够的内存,并且Spark自动切换到 MEMORY_AND_DISK
存储级别?
Is it because some partition didn't have enough Memory, and Spark automatically switched to MEMORY_AND_DISK
storage level?
推荐答案
是因为某些分区没有足够的内存,并且Spark自动切换到MEMORY_AND_DISK存储级别吗?
Is it because some partition didn't have enough Memory, and Spark automatically switched to MEMORY_AND_DISK storage level?
差不多.这是因为它不是 RDD
,而是 Dataset
,并且 Datasets
的默认存储级别是 MEMORY_AND_DISK
.否则,您的怀疑是对的-如果没有足够的内存或需要将缓存逐出,则数据将存入磁盘(但从技术上讲,它不是aspill).
Almost. It is because it is not RDD
, but Dataset
, and default storage level for Datasets
is MEMORY_AND_DISK
. Otherwise your suspicion is true - if there is not enough memory or cache eviction is required data goes to disk (but technically speaking it is not aspill).
这篇关于为什么SPARK缓存的RDD溢出到磁盘上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!