退出代码和退出状态是否意味着火花? [英] Do exit codes and exit statuses mean anything in spark?

查看:88
本文介绍了退出代码和退出状态是否意味着火花?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在纱线上运行火花时,我总是看到退出代码和退出状态:

I see exit codes and exit statuses all the time when running spark on yarn:

以下是一些:

  • CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

...failed 2 times due to AM Container for application_1431523563856_0001_000002 exited with exitCode: 10...

...Exit status: 143. Diagnostics: Container killed on request

...Container exited with a non-zero exit code 52:...

...Container killed on request. Exit code is 137...

我从来没有发现这些消息有用....有没有机会解释这些消息实际上出了什么问题?我在高低处搜索了一张解释错误的表格,但一无所获.

I have never found any of these messages as being useful....Is there any chance of interpreting what actually goes wrong with these? I have searched high and low for a table explaining the errors but nothing.

我只能从上面的代码中解密的是退出代码52,但这是因为我查看了源代码

The ONLY one I am able to decipher from those above is exit code 52, but that's because I looked at the source code here. It is saying that is an OOM.

我应该停止尝试解释其余这些退出代码和退出状态吗?还是我错过了一些明显的方式,这些数字实际上意味着什么?

Should I stop trying to interpret the rest of these exit codes and exit statuses? Or am I missing some obvious way that these numbers actually mean something?

即使有人可以告诉我exit codeexit statusSIGNAL之间的区别也很有用.但是我现在只是在随机猜测,似乎我周围的每个使用spark的人也是.

Even if someone could tell me the difference between exit code, exit status, and SIGNAL that would be useful. But I am just randomly guessing right now, and it seems as everyone else around me who uses spark is, too.

最后,为什么某些退出代码小于零,以及如何解释这些退出代码?

And, finally, why are some of the exit codes less than zero and how to interpret those?

例如Exit status: -100. Diagnostics: Container released on a *lost* node

推荐答案

退出代码和状态或信号都不是Spark特有的,而是进程在类似Unix的系统上工作的一部分方式.

Neither exit codes and status nor signals are Spark specific but part of the way processes work on Unix-like systems.

同一事物的退出状态和退出代码是不同的名称.退出状态是介于0到255之间的数字,它指示进程终止后的结果.退出状态0通常表示成功.其他代码的含义取决于程序,应在程序的文档中进行说明.但是,有一些已建立的标准代码.有关完整列表,请参见此答案.

Exit status and exit codes are different names for the same thing. An exit status is a number between 0 and 255 which indicates the outcome of a process after it terminated. Exit status 0 usually indicates success. The meaning of the other codes is program dependent and should be described in the program's documentation. There are some established standard codes, though. See this answer for a comprehensive list.

火花源中,我发现了以下内容 退出代码.他们的描述来自代码中的日志语句和注释,以及我对退出状态出现在代码中的理解.

In the Spark sources I found the following exit codes. Their descriptions are taken from log statements and comments in the code and from my understanding of the code where the exit status appeared.

  • 3 :如果在设置stdoutstderr流时发生UnsupportedEncodingException.
  • 3: if an UnsupportedEncodingException occurred when setting up stdout and stderr streams.
  • 10::如果发生未捕获的异常
  • 11 :如果发生了超过spark.yarn.scheduler.reporterThread.maxFailures个执行器故障
  • 12 :如果报告线程因异常而失败
  • 13::如果程序在用户初始化spark上下文之前终止,或者spark上下文在超时之前未初始化.
  • 14::它声明为EXIT_SECURITY,但从未使用
  • 15::如果用户类抛出异常
  • 16::如果在报告最终状态之前调用了关闭挂钩.源代码中的注释说明了用户应用程序的预期行为:
  • 10: if an uncaught exception occurred
  • 11: if more than spark.yarn.scheduler.reporterThread.maxFailures executor failures occurred
  • 12: if the reporter thread failed with an exception
  • 13: if the program terminated before the user had initialized the spark context or if the spark context did not initialize before a timeout.
  • 14: This is declared as EXIT_SECURITY but never used
  • 15: if a user class threw an exception
  • 16: if the shutdown hook called before final status was reported. A comment in the source code explains the expected behaviour of user applications:

如果ApplicationMaster的默认状态是通过关闭挂钩调用的,则会失败. 与1.x版本相比,此行为有所不同. 如果通过调用System.exit(N)提前退出用户应用程序,请在此处标记 此应用程序失败,并显示EXIT_EARLY.为了获得良好的关机效果,用户不应致电 System.exit(0)终止应用程序.

The default state of ApplicationMaster is failed if it is invoked by shut down hook. This behavior is different compared to 1.x version. If user application is exited ahead of time by calling System.exit(N), here mark this application as failed with EXIT_EARLY. For a good shutdown, user shouldn't call System.exit(0) to terminate the application.

  • 50:已达到默认的未捕获异常处理程序
  • 51::调用了默认的未捕获异常处理程序,并且在记录异常时遇到了异常
  • 52:已达到默认的未捕获异常处理程序,并且未捕获异常是OutOfMemoryError
  • 53:多次尝试后DiskStore无法创建本地临时目录(错误spark.local.dir?)
  • 54:多次尝试后,ExternalBlockStore未能初始化
  • 55:多次尝试后,ExternalBlockStore无法创建本地临时目录
  • 56::执行程序向驱动程序发送心跳的次数超过"spark.executor.heartbeat.maxFailures"次数.

  • 50: The default uncaught exception handler was reached
  • 51: The default uncaught exception handler was called and an exception was encountered while logging the exception
  • 52: The default uncaught exception handler was reached, and the uncaught exception was an OutOfMemoryError
  • 53: DiskStore failed to create local temporary directory after many attempts (bad spark.local.dir?)
  • 54: ExternalBlockStore failed to initialize after many attempts
  • 55: ExternalBlockStore failed to create a local temporary directory after many attempts
  • 56: Executor is unable to send heartbeats to the driver more than "spark.executor.heartbeat.maxFailures" times.

101::如果未找到子主类,则通过spark-submit返回.在客户端模式(命令行选项--deploy-mode client)中,子主类是用户提交的应用程序类(--class CLASS).在群集模式(--deploy-mode cluster)中,子主类是特定于群集管理器的提交/客户端类.

101: Returned by spark-submit if the child main class was not found. In client mode (command line option --deploy-mode client) the child main class is the user submitted application class (--class CLASS). In cluster mode (--deploy-mode cluster) the child main class is the cluster manager specific submission/client class.

这些退出代码很可能是由以下原因导致的程序关闭导致的: Unix信号.可以通过从退出代码中减去128来计算信号编号. 博客文章(最初是在此问题中链接的. 也有一个很好的答案,解释了JVM生成的退出代码.如

These exit codes most likely result from a program shutdown triggered by a Unix signal. The signal number can be calculated by substracting 128 from the exit code. This is explained in more details in this blog post (which was originally linked in this question). There is also a good answer explaining JVM-generated exit codes. Spark works with this assumption as explained in a comment in ExecutorExitCodes.scala

除了上面列出的退出代码外,Spark来源中还有许多System.exit()调用将设置为1或-1作为退出代码.就我而言,tell -1似乎用于指示缺少或不正确的命令行参数,而1则用于指示所有其他错误.

Apart from the exit codes listed above there are number of System.exit() calls in the Spark sources setting 1 or -1 as exit code. As far as I an tell -1 seems to be used to indicate missing or incorrect command line parameters while 1 indicates all other errors.

信号是一种事件,它允许将系统消息发送到进程.例如,这些消息用于要求进程重新加载其配置(SIGHUP)或终止自身(SIGKILL).可以在 signal(7)手册页中找到标准信号的列表标准信号部分.

Signals are a kind of events which allow to send system messages to a process. These messages are used to ask a process to reload its configuration (SIGHUP) or to terminate itself (SIGKILL), for instance. A list of standard signals can be found in the signal(7) man page in section Standard Signals.

正如瑞克·莫里茨(Rick Moritz)在下面的评论中解释的(谢谢!),Spark设置中最可能的信号源是

As explained by Rick Moritz in the comments below (thank you!), the most likely sources of signals in a Spark setup are

  • 集群资源管理器:超过容器大小,作业完成,进行了动态缩小或用户中止了作业
  • 操作系统:作为受控系统的一部分关闭,或者达到了某些资源限制(内存不足,超出硬配额,磁盘上没有剩余空间等)
  • 杀死工作的本地用户
  • the cluster resource manager when the container size exceeded, the job finished, a dynamic scale-down was made, or a job was aborted by the user
  • the operating system: as part of a controlled system shut down or if some resource limit was hit (out of memory, over hard quota, no space left on disk etc.)
  • a local user who killed a job

我希望这可以使Spark发出的这些消息的含义更加清楚.

I hope this makes it a bit clearer what these messages by spark might mean.

这篇关于退出代码和退出状态是否意味着火花?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆