阿帕奇星火：作业已中止由于舞台故障：＆QUOT; TID X发生故障，原因不明＆QUOT; [英] Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons"

查看：255 发布时间：2016/5/22 15:49:55 python apache-spark

本文介绍了阿帕奇星火：作业已中止由于舞台故障：＆QUOT; TID X发生故障，原因不明＆QUOT;的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我处理，我的觉得有些奇怪的错误信息的归结为一种记忆的问题，但我有牵制下来很难，可能使用一些指导从专家。

I'm dealing with some strange error messages that I think comes down to a memory issue, but I'm having a hard time pinning it down and could use some guidance from the experts.

我有一个2火花机（1.0.1）群集。两台机器有8个核心;人有16GB内存，32GB等（这是主）。我的应用程序涉及计算图像中的像素成对亲和力，虽然图片我已经测试至今只获得尽可能大的1920×1200，并且小到16×16。

I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores; one has 16GB memory, the other 32GB (which is the master). My application involves computing pairwise pixel affinities in images, though the images I've tested so far only get as big as 1920x1200, and as small as 16x16.

我也不得不改变一些存储和并行设置，否则我得到的明确OutOfMemoryExceptions。在火花default.conf：

I did have to change a few memory and parallelism settings, otherwise I was getting explicit OutOfMemoryExceptions. In spark-default.conf:

spark.executor.memory    14g
spark.default.parallelism    32
spark.akka.frameSize        1000

在spark-env.sh：

In spark-env.sh:

SPARK_DRIVER_MEMORY=10G

使用这些设置，但是，我得到了一堆警告迷失的TID（没有任务成功完成）除了失去执行人，这是重复4次，直到我最终得到了下面的错误信息和崩溃报告：

With those settings, however, I get a bunch of WARN statements about "Lost TIDs" (no task is successfully completed) in addition to lost Executors, which are repeated 4 times until I finally get the following error message and crash:

14/07/18 12:06:20 INFO TaskSchedulerImpl: Cancelling stage 0
14/07/18 12:06:20 INFO DAGScheduler: Failed to run collect at /home/user/Programming/PySpark-Affinities/affinity.py:243
Traceback (most recent call last):
  File "/home/user/Programming/PySpark-Affinities/affinity.py", line 243, in <module>
    lambda x: np.abs(IMAGE.value[x[0]] - IMAGE.value[x[1]])
  File "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/pyspark/rdd.py", line 583, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 537, in __call__
  File "/net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o27.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:13 failed 4 times, most recent failure: TID 32 on host master.host.univ.edu failed for unknown reason
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    at akka.dispatch.Mailbox.run(Mailbox.scala:219)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

14/07/18 12:06:20 INFO DAGScheduler: Executor lost: 4 (epoch 4)
14/07/18 12:06:20 INFO BlockManagerMasterActor: Trying to remove executor 4 from BlockManagerMaster.
14/07/18 12:06:20 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
user@master:~/Programming/PySpark-Affinities$

如果我运行真正的小图片来代替（16×16），它似乎运行完毕（给我我期望的输出，而不引发的任何异常）。然而，在所运行应用程序的标准错误日志，它列出了为封杀的最后消息错误CoarseGrainedExecutorBackend：驱动程序解除关联的状态。如果我运行任何较大的图像，我得到我上面粘贴的除外。

If I run the really small image instead (16x16), it appears to run to completion (gives me the output I expect without any exceptions being thrown). However, in the stderr logs for the app that was run, it lists the state as "KILLED" with the final message a "ERROR CoarseGrainedExecutorBackend: Driver Disassociated". If I run any larger images, I get the exception I pasted above.

此外，如果我只是做了以主火花提交=本地[*] ，除了仍需要设置上述内存选项，它会为一个工作任何大小的图像（我单独测试两台机器，如运行时，他们都这样做本地[*] ）。

Furthermore, if I just do a spark-submit with master=local[*], aside from still needing to set the aforementioned memory options, it will work for an image of any size (I've tested both machines independently; they both do this when running as local[*]).

任何想法是怎么回事？

阿帕奇星火：作业已中止由于舞台故障：＆QUOT; TID X发生故障，原因不明＆QUOT; [英] Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons"

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

阿帕奇星火：作业已中止由于舞台故障：＆QUOT; TID X发生故障，原因不明＆QUOT; [英] Apache Spark: Job aborted due to stage failure: &quot;TID x failed for unknown reasons&quot;

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

阿帕奇星火：作业已中止由于舞台故障：＆QUOT; TID X发生故障，原因不明＆QUOT; [英] Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons"

登录关闭