Spark缓存仅保留RDD的一小部分 [英] spark cache only keeps a fraction of RDD

查看:109
本文介绍了Spark缓存仅保留RDD的一小部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我显式调用rdd.cache时,从Spark Console的存储"选项卡中可以看到,实际上只有rdd的一部分被缓存.我的问题是其余部分在哪里? Spark如何确定要保留在缓存中的部分?

When I explicitly call rdd.cache, I can see from the spark console storage tab that only a fraction of the rdd is actually cached. My question is where are the remaining parts? How does Spark decide which part to leave in cache?

同一问题适用于sc.textFile()读取的初始原始数据.我了解这些rdd会自动缓存,即使Spark Console存储表未显示有关其缓存状态的任何信息.我们知道其中有多少被缓存还是丢失?

The same question applies to the initial raw data read in by sc.textFile(). I understand these rdd's are automatically cached, even though the spark console storage table does not display any information on their cache status. Do we know how much of those are cached vs. missing?

推荐答案

cache()persist(StorageLevel.MEMORY_ONLY)相同,并且您的数据量可能超出了可用内存.然后,Spark以最近最少使用"的方式逐出缓存.

cache() is the same as persist(StorageLevel.MEMORY_ONLY), and your amount of data probably exceeds the available memory. Spark then evicts caches in a "least recently used" manner.

您可以通过设置配置选项来调整保留的内存以进行缓存.有关详细信息,请参见 Spark文档,并注意:spark.executor.memoryspark.storage.memoryFraction

You can tweak the reserved memory for caching by setting configuration options. See the Spark Documentation for details and look out for: spark.driver.memory, spark.executor.memory, spark.storage.memoryFraction

不是专家,但我认为textFile()不会自动缓存任何内容. Spark快速入门显式缓存文本文件RDD:sc.textFile(logFile, 2).cache()

Not an expert, but I do not think that textFile() automatically caches anything; the Spark Quick Start explicitly caches a text file RDD: sc.textFile(logFile, 2).cache()

这篇关于Spark缓存仅保留RDD的一小部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆