Spark:执行程序内存超过物理限制 [英] Spark: executor memory exceeds physical limit

查看:718
本文介绍了Spark:执行程序内存超过物理限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的输入数据集约为150G. 我正在设置

My input dataset is about 150G. I am setting

--conf spark.cores.max=100 
--conf spark.executor.instances=20 
--conf spark.executor.memory=8G 
--conf spark.executor.cores=5 
--conf spark.driver.memory=4G

但是由于数据在执行者之间分布不均,所以我不断获得

but since data is not evenly distributed across executors, I kept getting

Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used

这是我的问题:

1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer
2. I think about repartition the input dataFrame, so how can I determine how many partition to set? the higher the better, or?
3. The error says "9 GB physical memory used", but i only set 8G to executor memory, where does the extra 1G come from?

谢谢!

推荐答案

9GB由8GB执行程序内存组成,您将其添加为参数spark.yarn.executor.memoryOverhead并将其设置为.1,因此该内存的总内存为容器是spark.yarn.executor.memoryOverhead + (spark.yarn.executor.memoryOverhead * spark.yarn.executor.memoryOverhead),即8GB + (.1 * 8GB) ≈ 9GB.

The 9GB is composed of the 8GB executor memory which you add as a parameter, spark.yarn.executor.memoryOverhead which is set to .1, so the total memory of the container is spark.yarn.executor.memoryOverhead + (spark.yarn.executor.memoryOverhead * spark.yarn.executor.memoryOverhead) which is 8GB + (.1 * 8GB) ≈ 9GB.

您可以使用一个执行程序来运行整个过程,但这会花费很多时间. 要了解这一点,您需要了解分区的概念和任务.分区的数量由您的输入和操作定义.例如,如果您从hdfs中读取了150gb的csv,而hdfs的块大小为128mb,则最终会出现150 * 1024 / 128 = 1200分区,该分区直接映射到Spark UI中的1200个任务.

You could run the entire process using a single executor, but this would take ages. To understand this you need to know the notion of partitions and tasks. The number of partition is defined by your input and the actions. For example, if you read a 150gb csv from hdfs and your hdfs blocksize is 128mb, you will end up with 150 * 1024 / 128 = 1200 partitions, which maps directly to 1200 tasks in the Spark UI.

执行者将接管每一项任务.您无需再将所有150gb的内存都保存在内存中.例如,当您只有一个执行程序时,显然不会从Spark的并行功能中受益,但是它只会从第一个任务开始,处理数据,然后将其保存回dfs,然后开始处理下一个任务.

Every single tasks will be picked up by an executor. You don't need to hold all the 150gb in memory ever. For example, when you have a single executor, you obviously won't benefit from the parallel capabilities of Spark, but it will just start at the first task, process the data, and save it back to the dfs, and start working on the next task.

您应检查的内容:

  • 输入分区有多大? 输入文件是否可以拆分?如果单个执行程序必须加载大量内存,它肯定会用完内存.
  • 您正在执行哪种动作?例如,如果您以非常低的基数进行联接,则最终将导致一个庞大的分区,因为具有特定值的所有行最终都将在同一分区中.
  • 执行的费用很高还是效率很低?任何笛卡尔积等.
  • How big are the input partitions? Is the input file splittable at all? If a single executor has to load a massive amount of memory, it will run out of memory for sure.
  • What kind of actions are you performing? For example, if you do a join with very low cardinality, you end up with a massive partitions because all the rows with a specific value, end up in the same partitions.
  • Very expensive or inefficient actions performed? Any cartesian product etc.

希望这会有所帮助.祝你开心!

Hope this helps. Happy sparking!

这篇关于Spark:执行程序内存超过物理限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆