如何计算缓存中特定RDD的大小? [英] How do I figure out the size of specific RDDs in the cache?

查看:57
本文介绍了如何计算缓存中特定RDD的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常处理因超出内存限制而被YARN杀死的容器.我怀疑这与以低效方式缓存/取消保留RDDS/Dataframe有关.

I am frequently dealing with containers getting killed by YARN for exceeding memory limits. I suspect it has to do with caching/unpersisting RDDS/Dataframes in an inefficient manner.

调试此类问题的最佳方法是什么?

What is the best way to debug this type of issue?

我已经查看了Spark Web UI中的存储"选项卡,但是"RDD名称"没有比"MapPartitionsRDD"或"UnionRDD"更具描述性.如何确定哪些特定的RDD占用了缓存中的最大空间?

I have looked at the "Storage" tab in the Spark Web UI, but the "RDD Names" don't get any more descriptive than "MapPartitionsRDD" or "UnionRDD". How do I figure out which specific RDDs take up the most space in the cache?

为了找出内存不足"错误,我将需要找出哪些RDD占用了缓存中的最大空间.我还希望能够跟踪他们何时坚持下来.

In order to figure out the Out of Memory errors, I will need to figure out which RDDs are taking up the most space in the cache. I also want to be able to track when they get unpersisted.

推荐答案

  • 对于 RDD ,您可以使用 setName 方法设置有意义的名称:

    • For the RDDs you can set meaningful names using setName method:

      val rdd: RDD[T] = ???
      rdd.setName("foo")
      

    • 对于目录支持的表:

      val df: DataFrame = ???
      df.createOrReplaceTempView("foo")
      spark.catalog.cacheTable("foo")
      

      目录中的名称将同时反映在UI和 SparkContext.getPersistentRDD 中.

      the name in the catalog will be reflected in both UI and SparkContext.getPersistentRDD.

      我不知道任何适用于独立 Datasets 的解决方案.

      I am not aware of any solution which works for standalone Datasets.

      这篇关于如何计算缓存中特定RDD的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆