如何计算缓存中特定RDD的大小? [英] How do I figure out the size of specific RDDs in the cache?
问题描述
我经常处理因超出内存限制而被YARN杀死的容器.我怀疑这与以低效方式缓存/取消保留RDDS/Dataframe有关.
I am frequently dealing with containers getting killed by YARN for exceeding memory limits. I suspect it has to do with caching/unpersisting RDDS/Dataframes in an inefficient manner.
调试此类问题的最佳方法是什么?
What is the best way to debug this type of issue?
我已经查看了Spark Web UI中的存储"选项卡,但是"RDD名称"没有比"MapPartitionsRDD"或"UnionRDD"更具描述性.如何确定哪些特定的RDD占用了缓存中的最大空间?
I have looked at the "Storage" tab in the Spark Web UI, but the "RDD Names" don't get any more descriptive than "MapPartitionsRDD" or "UnionRDD". How do I figure out which specific RDDs take up the most space in the cache?
为了找出内存不足"错误,我将需要找出哪些RDD占用了缓存中的最大空间.我还希望能够跟踪他们何时坚持下来.
In order to figure out the Out of Memory errors, I will need to figure out which RDDs are taking up the most space in the cache. I also want to be able to track when they get unpersisted.
推荐答案
-
对于
RDD
,您可以使用setName
方法设置有意义的名称:For the
RDDs
you can set meaningful names usingsetName
method:val rdd: RDD[T] = ??? rdd.setName("foo")
-
对于
目录
支持的表:val df: DataFrame = ??? df.createOrReplaceTempView("foo") spark.catalog.cacheTable("foo")
目录中的名称将同时反映在UI和
SparkContext.getPersistentRDD
中.the name in the catalog will be reflected in both UI and
SparkContext.getPersistentRDD
.我不知道任何适用于独立
Datasets
的解决方案.I am not aware of any solution which works for standalone
Datasets
.这篇关于如何计算缓存中特定RDD的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!