"spark.yarn.executor.memoryOverhead"之间的区别和"spark.memory.offHeap.size"; [英] Difference between "spark.yarn.executor.memoryOverhead" and "spark.memory.offHeap.size"

查看:950
本文介绍了"spark.yarn.executor.memoryOverhead"之间的区别和"spark.memory.offHeap.size";的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在纱线上生出火花.我不明白以下设置spark.yarn.executor.memoryOverheadspark.memory.offHeap.size之间有什么区别.两者似乎都是用于分配堆外内存以激发执行程序的设置.我应该使用哪一个?另外,对执行程序堆内存的推荐设置是什么?

I am running spark on yarn. I don't understand what is the difference between the following settings spark.yarn.executor.memoryOverhead and spark.memory.offHeap.size. Both seem to be settings for allocating off-heap memory to spark executor. Which one should I use? Also what is the recommended setting for executor offheap memory?

非常感谢!

推荐答案

spark.executor.memoryOverhead由YARN之类的资源管理使用,而spark.memory.offHeap.size由Spark核心(内存管理器)使用.关系因版本而有所不同.

spark.executor.memoryOverhead is used by resource management like YARN, whereas spark.memory.offHeap.size is used by Spark core (memory manager). The relationship a bit different depending on the version.

Spark 2.4.5及更低版本:

spark.executor.memoryOverhead应包含spark.memory.offHeap.size.这意味着,如果指定offHeap.size,则需要为YARN将此部分手动添加到memoryOverhead.正如您从

spark.executor.memoryOverhead should include spark.memory.offHeap.size. This means that if you specify offHeap.size, you need to manually add this portion to memoryOverhead for YARN. As you can see from the code below from YarnAllocator.scala, when YARN request resource, it does not know anything about offHeap.size:

private[yarn] val resource = Resource.newInstance(
    executorMemory + memoryOverhead + pysparkWorkerMemory,
    executorCores)

但是,该行为在Spark 3.0中已更改:

spark.executor.memoryOverhead不再包含spark.memory.offHeap.size.请求资源时,YARN将为您提供offHeap.size.从新的文档:

spark.executor.memoryOverhead does not include spark.memory.offHeap.size anymore. YARN will include offHeap.size for you when requesting resources. From the new documentation:

注意:额外的内存包括PySpark执行程序内存(未配置spark.executor.pyspark.memory时)和在同一容器中运行的其他非执行程序进程所使用的内存.容器对正在运行的执行程序的最大内存大小由spark.executor.memoryOverhead,spark.executor.memory,spark.memory.offHeap.size和spark.executor.pyspark.memory的总和决定.

Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.

并且来自

And from the code you can also tell:

private[yarn] val resource: Resource = {
    val resource = Resource.newInstance(
      executorMemory + executorOffHeapMemory + memoryOverhead + pysparkWorkerMemory, executorCores)
    ResourceRequestHelper.setResourceRequests(executorResourceRequests, resource)
    logDebug(s"Created resource capability: $resource")
    resource
  }

有关此更改的更多详细信息,您可以参考此请求请求.

For more details of this change you can refer to this Pull Request.

对于您的第二个问题,建议的执行者内存不足的设置是什么?这取决于您的应用程序,您需要进行一些测试.我发现页有助于对此进行解释进一步:

For your second question, what is the recommended setting for executor offheap memory? It depends on your application and you need some testing. I found this page helpful to explain it further:

堆外内存是减少GC暂停的好方法,因为它不在GC的范围内.但是,它带来了序列化和反序列化的开销.后者又使得堆外数据有时可以放到堆内存中,并因此暴露给GC.此外,钨计划带来的新数据格式(字节数组)有助于减少GC开销.这两个原因使得在Apache Spark应用程序中使用堆外内存的行为应仔细计划,尤其是进行测试.

Off-heap memory is a great way to reduce GC pauses because it's not in the GC's scope. However, it brings an overhead of serialization and deserialization. The latter in its turn makes that the off-heap data can be sometimes put onto heap memory and hence be exposed to GC. Also, the new data format brought by Project Tungsten (array of bytes) helps to reduce the GC overhead. These 2 reasons make that the use of off-heap memory in Apache Spark applications should be carefully planned and, especially, tested.

BTW,spark.yarn.executor.memoryOverhead已弃用,并更改为spark.executor.memoryOverhead,这在YARN和Kubernetes中很常见.

BTW, spark.yarn.executor.memoryOverhead is deprecated and changed to spark.executor.memoryOverhead, which is common for YARN and Kubernetes.

这篇关于"spark.yarn.executor.memoryOverhead"之间的区别和"spark.memory.offHeap.size";的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆