Spark任务停留在RUNNING [英] Spark tasks stuck at RUNNING

查看:860
本文介绍了Spark任务停留在RUNNING的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在我的Yarn集群上运行Spark ML管道(从JDBC加载一些数据,运行一些转换器,训练模型),但是每次运行它时,都会有几个-有时是一个,有时是3或4-我的执行者卡在运行第一个任务集(三个内核中的每个内核要执行3个任务)时遇到了麻烦,而其余的则正常运行,一次检查了3个.

I'm trying to run a Spark ML pipeline (load some data from JDBC, run some transformers, train a model) on my Yarn cluster but each time I run it, a couple - sometimes one, sometimes 3 or 4 - of my executors get stuck running their first task set (that'd be 3 tasks for each of their 3 cores), while the rest run normally, checking off 3 at a time.

在用户界面中,您会看到以下内容:

In the UI, you'd see something like this:

到目前为止,我已经观察到一些东西:

Some things I have observed so far:

  • 当我将执行程序设置为每个执行程序分别使用1个内核和spark.executor.cores时(即一次运行1个任务),就不会出现此问题;
  • 被困住的执行者似乎总是为了执行任务而不得不改组一些分区的人;
  • 被卡住的任务最终将由另一个实例成功地以推测方式执行;
  • 有时,单个任务会卡在正常情况下的执行器中,但是其他两个内核仍可以正常工作;
  • 卡住的执行器实例看起来一切正常:CPU处于〜100%,有足够的可用内存,JVM进程处于活动状态,Spark或Yarn均未记录任何异常情况,并且它们仍可以接收来自驱动程序,例如放弃此任务,其他人已经推测性地执行了该任务",尽管出于某些原因,他们不要放弃它;
  • 那些执行者永远不会被驾驶员杀死,所以我想他们会不断发送自己的心跳信号;
  • When I set up my executors to use 1 core each with spark.executor.cores (i.e. run 1 task at a time), the issue does not occur;
  • The stuck executors always seem to be them ones that had to get some partitions shuffled to them in order to run the task;
  • The stuck tasks would ultimately get successfully speculatively executed by another instance;
  • Occasionally, a single task would get stuck in an executor that is otherwise normal, the other 2 cores would keep working fine, however;
  • The stuck executor instances look like everything is normal: CPU is at ~100%, plenty of memory to spare, the JVM processes are alive, neither Spark or Yarn log anything out of the ordinary and they can still receive instructions from the driver, such as "drop this task, someone else speculatively executed it already" -- though, for some reason, they don't drop it;
  • Those executors never get killed off by the driver, so I imagine they keep sending their heartbeats just fine;

关于什么可能导致这种情况或我应该尝试什么的任何想法?

Any ideas as to what may be causing this or what I should try?

推荐答案

TLDR :在怪罪Spark之前,请确保您的代码是线程安全的并且没有竞争条件.

TLDR: Make sure your code is threadsafe and race condition-free before you blame Spark.

弄清楚了.为了后代:使用了线程不安全的数据结构(可变的HashMap).由于同一台计算机上的执行程序共享一个JVM,因此导致数据争用锁定了单独的线程/任务.

Figured it out. For posterity: was using an thread-unsafe data structure (a mutable HashMap). Since executors on the same machine share a JVM, this was resulting in data races that were locking up the separate threads/tasks.

结果:当您拥有spark.executor.cores > 1(并且您应该这样做)时,请确保您的代码是线程安全的.

The upshot: when you have spark.executor.cores > 1 (and you probably should), make sure your code is threadsafe.

这篇关于Spark任务停留在RUNNING的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆