SPARK:YARN杀死超出内存限制的容器 [英] SPARK: YARN kills containers for exceeding memory limits

查看:146
本文介绍了SPARK:YARN杀死超出内存限制的容器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们当前遇到的一个问题是,在YARN上运行时,Spark作业看到大量容器因超出内存限制而被杀死.

We're currently encountering an issue where Spark jobs are seeing a number of containers being killed for exceeding memory limits when running on YARN.

16/11/18 17:58:52 WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715, XXXXXXXXXX): 
  ExecutorLostFailure (executor 23 exited caused by one of the running tasks) 
  Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used. 
    Consider boosting spark.yarn.executor.memoryOverhead.

正在通过spark-submit传递以下参数:

The following arguments are being passed via spark-submit:

--executor-memory=6G
--driver-memory=4G
--conf "spark.yarn.executor.memoryOverhead=6G"`

我正在使用Spark 2.0.1.

I am using Spark 2.0.1.

在阅读了有关YARN杀死容器的几篇文章(例如

We have increased the memoryOverhead to this value after reading several posts about YARN killing containers (e.g. How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?).

给出我的参数和日志消息,似乎确实是当纱线的内存使用量大于(executor-memory + executor.memoryOverhead)时,Yarn会杀死执行器".

Given my parameters and the log message it does seem that "Yarn kills executors when its memory usage is larger than (executor-memory + executor.memoryOverhead)".

继续增加此开销是不切实际的,希望最终我们找到一个不会发生这些错误的值.我们在几个不同的工作上都看到了这个问题.对于我应该更改的参数,应该检查的内容,应该在哪里开始进行调试的地方等方面的建议,我将不胜感激.能够提供更多的配置选项等.

It is not practical to continue increasing this overhead in the hope that eventually we find a value at which these errors do not occur. We are seeing this issue on several different jobs. I would appreciate any suggestions as to parameters I should change, things I should check, where I should start looking to debug this etc. Am able to provide further config options etc.

推荐答案

您可以在spark-defaults.conf中使用以下配置来减少内存使用:

You can reduce the memory usage with the following configurations in spark-defaults.conf:

spark.default.parallelism
spark.sql.shuffle.partitions

对于spark.sql.shuffle.partitions使用2000个以上的分区时会有区别.您可以在Github上的spark代码中看到它:

And there is a difference when you use more than 2000 partitions for spark.sql.shuffle.partitions. You can see it in the code of spark on Github:

private[spark] object MapStatus {

  def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
    if (uncompressedSizes.length > 2000) {
      HighlyCompressedMapStatus(loc, uncompressedSizes)
    } else {
      new CompressedMapStatus(loc, uncompressedSizes)
    }
}

我建议尝试使用超过2000个分区进行测试.当您使用非常大的数据集时,有时可能会更快.根据,您的任务可以短至200毫秒.正确的配置不容易找到,但是根据您的工作量,可能会花费数小时.

I recommend to try to use more than 2000 Partitions for a test. It could be faster some times, when you use very huge datasets. And according to this your tasks can be short as 200 ms. The correct configuration is not easy to find, but depending on your workload it can make a difference of hours.

这篇关于SPARK:YARN杀死超出内存限制的容器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆