纱线堆的使用量随时间增长 [英] Yarn Heap usage growing over time

查看:129
本文介绍了纱线堆的使用量随时间增长的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在AWS EMR上运行Spark Streaming作业。该作业将稳定运行10到14个小时,然后崩溃,并且stderr,stdout或Cloudwatch日志中没有明显的错误。在此崩溃之后,任何尝试重新启动作业的尝试都会立即失败,并显示无法分配内存(errno = 12)(完整的消息)。



对Cloudwatch指标和Ganglia的调查均显示 driver.jvm.heap.used 随着时间的推移稳步增长。



这两个观察结果使我相信,Spark的某些长期运行的组件(即高于作业级别的组件)未能正确释放内存。重新启动hadoop-yarn-resourcemanager(根据此处)会导致堆使用率降至



如果我的假设确实正确,那么什么会导致Yarn不断消耗越来越多的内存? (如果没有-我怎么能证伪?)




  • 我从此处表示 spark.streaming.unpersist 默认为true(尽管我我已经尝试在工作结束时添加手册 rdd.unpersist()只是为了检查是否有效果-它运行时间不足以告知

  • 此处,对 spark.yarn.am.extraJavaOptions 的注释表明,当以yarn-client模式运行时(即我们), spark.yarn.am.memory 设置Yarn Application Manager堆内存的最大使用量。此值不会在我们的工作中被覆盖(因此应默认为512m),但是Cloudwatch和Ganglia都清楚地显示了驱动程序堆的使用情况(以千兆字节为单位)。 div class = h2_lin>解决方案

事实证明,默认SparkUI值此处比我们的系统处理的要大得多。将它们设置为默认值的1/20后,系统已稳定运行了24小时,并且在这段时间内没有增加堆使用。



为清楚起见,被编辑的值是:

  * spark.ui.retainedJobs = 50 
* spark.ui.retainedStages = 50
* spark.ui.retainedTasks = 500
* spark.worker.ui.retainedExecutors = 50
* spark.worker.ui.retainedDrivers = 50
* spark.sql。 ui.retainedExecutions = 50
* spark.streaming.ui.retainedBatches = 50


We run a Spark Streaming job on AWS EMR. This job will run stably for anywhere between 10 and 14 hours, and then crash with no discernible errors in stderr, stdout, or Cloudwatch logs. After this crash, any attempts to restart the job will immediately fail with "'Cannot allocate memory' (errno=12)" (full message).

Investigation with both Cloudwatch metrics and Ganglia show that driver.jvm.heap.used is steadily growing over time.

Both of these observations led me to believe that some long-running component of Spark (i.e. above Job-level) was failing to free memory correctly. This is supported by the fact that restarting the hadoop-yarn-resourcemanager (as per here) causes heap usage to drop to "fresh cluster" levels.

If my assumption there is indeed correct - what would cause Yarn to keep consuming more and more memory? (If not - how could I falsify that?)

  • I see from here that spark.streaming.unpersist defaults to true (although I've tried adding a manual rdd.unpersist() at the end of my job anyway just to check whether that has any effect - it hasn't been running long enough to tell definitively, yet)
  • Here, the comment on spark.yarn.am.extraJavaOptions suggests that, when running in yarn-client mode (which we are), spark.yarn.am.memory sets the maximum Yarn Application Manager heap memory usage. This value is not overridden in our job (so should be at the default of 512m), but both Cloudwatch and Ganglia clearly show driver heap usage in the Gigabytes.

解决方案

It turns out that the default SparkUI values here were much larger than our system could handle. After setting them down to 1/20th of the default values, the system has been running stably for 24 hours with no increase in heap usage over that time.

For clarity, the values that were edited were:

* spark.ui.retainedJobs=50
* spark.ui.retainedStages=50
* spark.ui.retainedTasks=500
* spark.worker.ui.retainedExecutors=50
* spark.worker.ui.retainedDrivers=50
* spark.sql.ui.retainedExecutions=50
* spark.streaming.ui.retainedBatches=50

这篇关于纱线堆的使用量随时间增长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆