为什么没有使用所有可用执行程序的Spark大舞台? [英] Why isn't a very big Spark stage using all available executors?

查看:68
本文介绍了为什么没有使用所有可用执行程序的Spark大舞台?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在执行一些非常大的阶段(例如> 20k个任务)的Spark作业,并使用1k到2k的执行程序来运行它.

I am running a Spark job with some very big stages (e.g. >20k tasks), and am running it with 1k to 2k executors.

在某些情况下,一个阶段似乎运行不稳定:随着时间的推移,许多可用的执行器变得闲置,尽管它们仍处于许多未完成任务的中间.从用户的角度来看,任务似乎正在完成,但是完成给定任务的执行者不会获得分配给他们的新任务.结果,该阶段花费的时间超过了原本应有的时间,并且空闲时浪费了许多执行器CPU时间.这似乎大部分(仅?)发生在从HDFS读取数据的输入阶段.

In some cases, a stage will appear to run unstably: many available executors become idle over time, despite still being in the middle of a stage with many unfinished tasks. From the user perspective, it appears that tasks are finishing, but executors that have finished a given task do not get a new task assigned to them. As a result, the stage takes longer than it should, and a lot of executor CPU-hours are being wasted on idling. This seems to mostly (only?) happen during input stages, where data is being read from HDFS.

Spark stderr日志在不稳定时期的示例-请注意,随着时间的流逝,正在运行的任务数量逐渐减少,直到几乎达到零,然后突然跳回> 1k个正在运行的任务:

Example Spark stderr log during an unstable period -- notice that the number of running tasks decreases over time until it almost reaches zero, then suddenly jumps back up to >1k running tasks:

[Stage 0:==============================>                 (17979 + 1070) / 28504]
[Stage 0:==============================>                 (18042 + 1019) / 28504]
[Stage 0:===============================>                 (18140 + 921) / 28504]
[Stage 0:===============================>                 (18222 + 842) / 28504]
[Stage 0:===============================>                 (18263 + 803) / 28504]
[Stage 0:===============================>                 (18282 + 786) / 28504]
[Stage 0:===============================>                 (18320 + 751) / 28504]
[Stage 0:===============================>                 (18566 + 508) / 28504]
[Stage 0:================================>                (18791 + 284) / 28504]
[Stage 0:================================>                (18897 + 176) / 28504]
[Stage 0:================================>                (18940 + 134) / 28504]
[Stage 0:================================>                (18972 + 107) / 28504]
[Stage 0:=================================>                (19035 + 47) / 28504]
[Stage 0:=================================>                (19067 + 17) / 28504]
[Stage 0:================================>               (19075 + 1070) / 28504]
[Stage 0:================================>               (19107 + 1039) / 28504]
[Stage 0:================================>                (19165 + 982) / 28504]
[Stage 0:=================================>               (19212 + 937) / 28504]
[Stage 0:=================================>               (19251 + 899) / 28504]
[Stage 0:=================================>               (19355 + 831) / 28504]
[Stage 0:=================================>               (19481 + 708) / 28504]

这是阶段稳定运行时stderr的样子-正在运行的任务数量大致保持不变,因为新任务在执行者完成之前的任务时会分配给他们:

This is what the stderr looks like when a stage is running stably -- the number of running tasks remains roughly constant, because new tasks are assigned to executors as they finish their previous tasks:

[Stage 1:===================>                            (11599 + 2043) / 28504]
[Stage 1:===================>                            (11620 + 2042) / 28504]
[Stage 1:===================>                            (11656 + 2044) / 28504]
[Stage 1:===================>                            (11692 + 2045) / 28504]
[Stage 1:===================>                            (11714 + 2045) / 28504]
[Stage 1:===================>                            (11741 + 2047) / 28504]
[Stage 1:===================>                            (11771 + 2047) / 28504]
[Stage 1:===================>                            (11818 + 2047) / 28504]

在什么情况下会发生这种情况,我该如何避免这种行为?

Under what circumstances would this happen, and how can I avoid this behavior?

NB:我正在使用动态分配,但是我很确定这与该问题无关-例如,在不稳定时期,在Spark Application Master UI中,我可以看到预期的执行器数量是活动的".",但未运行活动任务".

NB: I am using dynamic allocation, but I'm pretty sure this is unrelated to this problem -- e.g., during an unstable period, in the Spark Application Master UI I can see that the expected number of executors are "Active", but are not running "Active Tasks."

推荐答案

当每个任务花费的时间很短时,我从spark看到了这样的行为.由于某种原因,调度程序似乎认为该作业将完成得更快而没有额外的分发开销,因为每个任务都完成得如此之快.

I've seen behavior like this from spark when the amount of time taken per task is very low. For some reason, the scheduler seems to assume that the job will complete faster without the extra distribution overhead, since each task is completing so quickly.

要尝试的几件事:

  • 尝试使用 .coalesce()减少分区的数量,以便每个分区都需要更长的运行时间(当然,这可能会导致洗牌,并可能增加总体工作量)时间,您将不得不试用)
  • 调整 spark.locality.wait * 设置这里.如果每个任务所花费的时间少于默认等待时间 3s ,则调度程序可能只是试图保持现有插槽已满,而再也没有机会分配更多的插槽.
  • Try .coalesce() to reduce the number of partitions, so that each partition takes longer to run (granted, this could cause a shuffle step and may increase overall job time, you'll have to expiriment)
  • Tweak the spark.locality.wait* settings here. If each task takes less than the default wait times of 3s, then perhaps the scheduler is just trying to keep the existing slots full and never has a chance to allocate more slots.

我还没有找到确切的原因是什么原因,所以这些只是基于我对自己(更小)集群的观察得出的推测和预感.

I've yet to track down exactly what causes this issue, so these are only speculations and hunches based on my own observations in my own (much smaller) cluster.

这篇关于为什么没有使用所有可用执行程序的Spark大舞台?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆