SPARK:YARN杀死超出内存限制的容器 [英] SPARK: YARN kills containers for exceeding memory limits
问题描述
我们当前遇到的一个问题是,在YARN上运行时,Spark作业看到大量容器因超出内存限制而被杀死.
We're currently encountering an issue where Spark jobs are seeing a number of containers being killed for exceeding memory limits when running on YARN.
16/11/18 17:58:52 WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715, XXXXXXXXXX):
ExecutorLostFailure (executor 23 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
正在通过spark-submit传递以下参数:
The following arguments are being passed via spark-submit:
--executor-memory=6G
--driver-memory=4G
--conf "spark.yarn.executor.memoryOverhead=6G"`
我正在使用Spark 2.0.1.
I am using Spark 2.0.1.
We have increased the memoryOverhead to this value after reading several posts about YARN killing containers (e.g. How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?).
给出我的参数和日志消息,似乎确实是当纱线的内存使用量大于(executor-memory + executor.memoryOverhead)时,Yarn会杀死执行器".
Given my parameters and the log message it does seem that "Yarn kills executors when its memory usage is larger than (executor-memory + executor.memoryOverhead)".
继续增加此开销是不切实际的,希望最终我们找到一个不会发生这些错误的值.我们在几个不同的工作上都看到了这个问题.对于我应该更改的参数,应该检查的内容,应该在哪里开始进行调试的地方等方面的建议,我将不胜感激.能够提供更多的配置选项等.
It is not practical to continue increasing this overhead in the hope that eventually we find a value at which these errors do not occur. We are seeing this issue on several different jobs. I would appreciate any suggestions as to parameters I should change, things I should check, where I should start looking to debug this etc. Am able to provide further config options etc.
推荐答案
您可以在spark-defaults.conf
中使用以下配置来减少内存使用:
You can reduce the memory usage with the following configurations in spark-defaults.conf
:
spark.default.parallelism
spark.sql.shuffle.partitions
对于spark.sql.shuffle.partitions
使用2000个以上的分区时会有区别.您可以在Github上的spark代码中看到它:
And there is a difference when you use more than 2000 partitions for spark.sql.shuffle.partitions
. You can see it in the code of spark on Github:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
我建议尝试使用超过2000个分区进行测试.当您使用非常大的数据集时,有时可能会更快.根据此,您的任务可以短至200毫秒.正确的配置不容易找到,但是根据您的工作量,可能会花费数小时.
I recommend to try to use more than 2000 Partitions for a test. It could be faster some times, when you use very huge datasets. And according to this your tasks can be short as 200 ms. The correct configuration is not easy to find, but depending on your workload it can make a difference of hours.
这篇关于SPARK:YARN杀死超出内存限制的容器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!