为什么要增加spark.yarn.executor.memoryOverhead? [英] Why increase spark.yarn.executor.memoryOverhead?
问题描述
我正在尝试加入两个大的spark数据帧,并继续遇到此错误:
I am trying to join two large spark dataframes and keep running into this error:
Container killed by YARN for exceeding memory limits. 24 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
这似乎是spark用户之间的常见问题,但我似乎找不到关于spark.yarn.executor.memoryOverheard是什么的任何可靠描述.在某些情况下,听起来好像是YARN杀死容器之前的一种内存缓冲区(例如,请求了10GB,但YARN直到使用10.2GB时才会杀死容器).在其他情况下,听起来好像它被用来执行某种与我要执行的分析完全分离的数据记帐任务.我的问题是:
This seems like a common issue among spark users, but I can't seem to find any solid descriptions of what spark.yarn.executor.memoryOverheard is. In some cases it sounds like it's a kind of memory buffer before YARN kills the container (e.g. 10GB was requested, but YARN won't kill the container until it uses 10.2GB). In other cases it sounds like it's being used to to do some kind of data accounting tasks that are completely separate from the analysis that I want to perform. My questions are:
- spark.yarn.executor.memoryOverhead用于什么?
- 增加这种内存而不是增加内存有什么好处 执行者的内存(或执行者的数量)?
- 总的来说,我可以采取一些措施来减少自己的体重 spark.yarn.executor.memory开销用法(例如 数据结构,限制数据帧的宽度,使用更少的执行程序和更多的内存等)?
- What is the spark.yarn.executor.memoryOverhead being using for?
- What is the benefit of increasing this kind of memory instead of executor memory (or the number of executors)?
- In general, are there things steps I can take to reduce my spark.yarn.executor.memoryOverhead usage (e.g. particular datastructures, limiting the width of the dataframes, using fewer executors with more memory, etc)?