如何在YARN帐户上运行Spark以获取Python内存使用情况? [英] How does Spark running on YARN account for Python memory usage?

查看:133
本文介绍了如何在YARN帐户上运行Spark以获取Python内存使用情况?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读了文档后,我不了解Spark如何在YARN上运行以解决Python内存消耗问题.

After reading through the documentation I do not understand how does Spark running on YARN account for Python memory consumption.

它计入spark.executor.memoryspark.executor.memoryOverhead还是在哪里?

尤其是我有一个带有spark.executor.memory=25Gspark.executor.cores=4的PySpark应用程序,并且经常遇到容器因超出内存限制而被YARN杀死..它可以在相当数量的复杂Python对象上运行,因此预计会占用一些不平凡的内存,但不会占用25GB.我应该如何配置不同的内存变量以用于繁重的Python代码?

In particular I have a PySpark application with spark.executor.memory=25G, spark.executor.cores=4 and I encounter frequent Container killed by YARN for exceeding memory limits. errors when running a map on an RDD. It operates on a fairly large amount of complex Python objects so it is expected to take up some non-trivial amount of memory but not 25GB. How should I configure the different memory variables for use with heavy Python code?

推荐答案

由于Python代码繁重,并且此属性值spark.python.worker.memory默认值( 512m ) >不计入spark.executor.memory .

I'd try to increase memory to spark.python.worker.memory default (512m) because of heavy Python code and this property value does not count in spark.executor.memory.

在聚合过程中每个python worker进程要使用的内存量, 格式与JVM内存字符串相同(例如512m,2g). 如果 聚合期间使用的内存超过此数量,将会溢出 数据放入磁盘. 链接

Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks. link

Spark中的ExecutorMemoryOverhead计算:

MEMORY_OVERHEAD_FRACTION = 0.10 
MEMORY_OVERHEAD_MINIMUM = 384 
val executorMemoryOverhead = 
  max(MEMORY_OVERHEAD_FRACTION * ${spark.executor.memory}, MEMORY_OVERHEAD_MINIMUM))

对于YARN和Mesos,该属性为spark.{yarn|mesos}.executor.memoryOverhead.

The property is spark.{yarn|mesos}.executor.memoryOverhead for YARN and Mesos.

YARN杀死占用的内存比其请求的内存更多的进程,这些内存是 executorMemory的总和.

YARN kills the processes which are taking more memory than they requested which is sum of executorMemoryOverhead and executorMemory.

在给定的图像中,python进程在工作人员中使用 spark.python.worker.memory,然后 spark.yarn.executor.memoryOverhead + spark.executor.memory是 特定的JVM.

In given image python processes in worker uses spark.python.worker.memory, then spark.yarn.executor.memoryOverhead + spark.executor.memory is specific JVM.

图片信用

其他资源 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆