退出代码和退出状态在 spark 中有什么意义吗? [英] Do exit codes and exit statuses mean anything in spark?

查看:57
本文介绍了退出代码和退出状态在 spark 中有什么意义吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 yarn 上运行 spark 时,我一直看到退出代码和退出状态:

这里有一些:

  • CoarseGrainedExecutorBackend:收到信号 15:SIGTERM

  • ...失败 2 次,原因是 application_1431523563856_0001_000002 的 AM 容器退出,exitCode: 10...

  • ...退出状态:143.诊断:容器在请求时被杀死

  • ...容器以非零退出代码 52 退出:...

  • ...容器应请求终止.退出代码是 137...

我从未发现这些消息中的任何一条有用....是否有机会解释这些消息实际上出了什么问题?我在高低处搜索了解释错误的表格,但什么也没找到.

我唯一能从上面破译的是退出代码 52,但那是因为我查看了源代码 这里.据说这是一个OOM.

我应该停止尝试解释其余的退出代码和退出状态吗?还是我错过了一些明显的方式,这些数字实际上意味着什么?

即使有人能告诉我 exit codeexit statusSIGNAL 之间的区别,那也会很有用.但我现在只是随意猜测,我周围的其他人似乎也使用 spark.

最后,为什么一些退出代码小于零以及如何解释这些?

例如退出状态:-100.诊断:容器在*丢失*节点上发布

解决方案

退出代码和状态或信号都不是 Spark 特定的,而是进程在类 Unix 系统上工作的一部分.

退出状态和退出代码

退出状态和退出代码是同一事物的不同名称.退出状态是一个介于 0 和 255 之间的数字,表示进程终止后的结果.退出状态 0 通常表示成功.其他代码的含义取决于程序,应在程序文档中进行说明.不过,有一些既定的标准代码.有关完整列表,请参阅此答案.

Spark 使用的退出代码

Spark 源 中,我发现了以下内容退出代码.他们的描述来自代码中的日志语句和注释,以及我对退出状态出现的代码的理解.

Hive Thrift Server 中的 Spark SQL CLI 驱动

  • 3:如果在设置 stdoutstderr 流时发生 UnsupportedEncodingException.

火花/纱线

  • 10:如果发生了未捕获的异常
  • 11:如果超过 spark.yarn.scheduler.reporterThread.maxFailures 执行器失败发生
  • 12:如果报告线程因异常而失败
  • 13:如果程序在用户初始化 spark 上下文之前终止,或者 spark 上下文在超时之前没有初始化.
  • 14:这被声明为 EXIT_SECURITY 但从未使用过
  • 15:如果用户类抛出异常
  • 16: 如果在报告最终状态之前调用了关闭挂钩.源代码中的注释解释了用户应用程序的预期行为:<块引用>

    ApplicationMaster 的默认状态是被关闭钩子调用的失败.与 1.x 版本相比,此行为有所不同.如果用户应用程序通过调用System.exit(N)提前退出,这里标记此应用程序失败,EXIT_EARLY.为了良好的关机,用户不应该打电话System.exit(0) 终止应用程序.

执行者

  • 50:已达到默认的未捕获异常处理程序
  • 51:调用了默认的未捕获异常处理程序,并在记录异常时遇到异常
  • 52:达到了默认的未捕获异常处理程序,未捕获的异常是 OutOfMemoryError
  • 53: DiskStore 在多次尝试后未能创建本地临时目录(spark.local.dir 错误?)
  • 54:ExternalBlockStore 在多次尝试后未能初始化
  • 55:ExternalBlockStore 在多次尝试后未能创建本地临时目录
  • 56: Executor 无法向驱动程序发送超过spark.executor.heartbeat.maxFailures"次的心跳.

  • 101: 如果未找到子主类,则由 spark-submit 返回.在客户端模式下(命令行选项--deploy-mode client)子主类是用户提交的应用程序类(--class CLASS).在集群模式(--deploy-mode cluster)下,子主类是集群管理器特定的提交/客户端类.

退出代码大于 128

这些退出代码很可能是由以下原因触发的程序关闭引起的Unix 信号.信号编号可以通过从退出代码中减去 128 来计算.此博文(最初在这个问题中链接).还有一个很好的解释 JVM 生成的退出代码的答案.正如 ExecutorExitCodes.scala

其他退出代码

除了上面列出的退出代码之外,Spark 源中还有许多 System.exit() 调用设置为 1 或 -1 作为退出代码.就我而言,tell -1 似乎用于表示缺少或不正确的命令行参数,而 1 表示所有其他错误.

信号

信号是一种允许向进程发送系统消息的事件.例如,这些消息用于要求进程重新加载其配置(SIGHUP)或终止自身(SIGKILL).标准信号列表可以在 signal(7) 手册页标准信号部分.

正如 Rick Moritz 在下面的评论中所解释的(谢谢!),Spark 设置中最可能的信号源是

  • 集群资源管理器当超过容器大小、作业完成、动态缩小或用户中止作业时
  • 操作系统:作为受控系统的一部分关闭或达到某些资源限制(内存不足、超过硬配额、磁盘上没有剩余空间等)
  • 杀死工作的本地用户

我希望这能让您更清楚地了解 spark 中这些消息的含义.

I see exit codes and exit statuses all the time when running spark on yarn:

Here are a few:

  • CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

  • ...failed 2 times due to AM Container for application_1431523563856_0001_000002 exited with exitCode: 10...

  • ...Exit status: 143. Diagnostics: Container killed on request

  • ...Container exited with a non-zero exit code 52:...

  • ...Container killed on request. Exit code is 137...

I have never found any of these messages as being useful....Is there any chance of interpreting what actually goes wrong with these? I have searched high and low for a table explaining the errors but nothing.

The ONLY one I am able to decipher from those above is exit code 52, but that's because I looked at the source code here. It is saying that is an OOM.

Should I stop trying to interpret the rest of these exit codes and exit statuses? Or am I missing some obvious way that these numbers actually mean something?

Even if someone could tell me the difference between exit code, exit status, and SIGNAL that would be useful. But I am just randomly guessing right now, and it seems as everyone else around me who uses spark is, too.

And, finally, why are some of the exit codes less than zero and how to interpret those?

E.g. Exit status: -100. Diagnostics: Container released on a *lost* node

解决方案

Neither exit codes and status nor signals are Spark specific but part of the way processes work on Unix-like systems.

Exit status and exit code

Exit status and exit codes are different names for the same thing. An exit status is a number between 0 and 255 which indicates the outcome of a process after it terminated. Exit status 0 usually indicates success. The meaning of the other codes is program dependent and should be described in the program's documentation. There are some established standard codes, though. See this answer for a comprehensive list.

Exit codes used by Spark

In the Spark sources I found the following exit codes. Their descriptions are taken from log statements and comments in the code and from my understanding of the code where the exit status appeared.

Spark SQL CLI Driver in Hive Thrift Server

  • 3: if an UnsupportedEncodingException occurred when setting up stdout and stderr streams.

Spark/Yarn

  • 10: if an uncaught exception occurred
  • 11: if more than spark.yarn.scheduler.reporterThread.maxFailures executor failures occurred
  • 12: if the reporter thread failed with an exception
  • 13: if the program terminated before the user had initialized the spark context or if the spark context did not initialize before a timeout.
  • 14: This is declared as EXIT_SECURITY but never used
  • 15: if a user class threw an exception
  • 16: if the shutdown hook called before final status was reported. A comment in the source code explains the expected behaviour of user applications:

    The default state of ApplicationMaster is failed if it is invoked by shut down hook. This behavior is different compared to 1.x version. If user application is exited ahead of time by calling System.exit(N), here mark this application as failed with EXIT_EARLY. For a good shutdown, user shouldn't call System.exit(0) to terminate the application.

Executors

  • 50: The default uncaught exception handler was reached
  • 51: The default uncaught exception handler was called and an exception was encountered while logging the exception
  • 52: The default uncaught exception handler was reached, and the uncaught exception was an OutOfMemoryError
  • 53: DiskStore failed to create local temporary directory after many attempts (bad spark.local.dir?)
  • 54: ExternalBlockStore failed to initialize after many attempts
  • 55: ExternalBlockStore failed to create a local temporary directory after many attempts
  • 56: Executor is unable to send heartbeats to the driver more than "spark.executor.heartbeat.maxFailures" times.

  • 101: Returned by spark-submit if the child main class was not found. In client mode (command line option --deploy-mode client) the child main class is the user submitted application class (--class CLASS). In cluster mode (--deploy-mode cluster) the child main class is the cluster manager specific submission/client class.

Exit codes greater than 128

These exit codes most likely result from a program shutdown triggered by a Unix signal. The signal number can be calculated by substracting 128 from the exit code. This is explained in more details in this blog post (which was originally linked in this question). There is also a good answer explaining JVM-generated exit codes. Spark works with this assumption as explained in a comment in ExecutorExitCodes.scala

Other exit codes

Apart from the exit codes listed above there are number of System.exit() calls in the Spark sources setting 1 or -1 as exit code. As far as I an tell -1 seems to be used to indicate missing or incorrect command line parameters while 1 indicates all other errors.

Signals

Signals are a kind of events which allow to send system messages to a process. These messages are used to ask a process to reload its configuration (SIGHUP) or to terminate itself (SIGKILL), for instance. A list of standard signals can be found in the signal(7) man page in section Standard Signals.

As explained by Rick Moritz in the comments below (thank you!), the most likely sources of signals in a Spark setup are

  • the cluster resource manager when the container size exceeded, the job finished, a dynamic scale-down was made, or a job was aborted by the user
  • the operating system: as part of a controlled system shut down or if some resource limit was hit (out of memory, over hard quota, no space left on disk etc.)
  • a local user who killed a job

I hope this makes it a bit clearer what these messages by spark might mean.

这篇关于退出代码和退出状态在 spark 中有什么意义吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆