在`cluster`模式星火的驱动程序启动以一种不可思议的方式失败 [英] Spark driver program launching in `cluster` mode failed in a weird way

查看:368
本文介绍了在`cluster`模式星火的驱动程序启动以一种不可思议的方式失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的火花。现在我遇到一个问题:当我在一个独立的火花集群启动一个程序,而命令行:

  ./火花提交--class scratch.Pi --deploy模式集群--executor内存5克--name PI --driver内存5克--driver-java的 - 选项-XX:MaxPermSize参数=1024米--master火花:// BX-42-68:7077 HDFS:// BX-42-68:9000 /罐/ pi.jar

它将引发以下错误:

  15/01/28 19点48分五十一秒INFO Slf4jLogger:Slf4jLogger开始
15/01/28 19时48分51秒INFO utils的:成功启动服务driverClient端口59290。
发送启动命令的火​​花:// BX-42-68:7077
驱动程序成功提交的驾驶员20150128194852-0003
...投票主人面前等待驱动程序状态
...轮询主驾驶员状态
驾驶员20150128194852-0003的状态为失败

以下日志集群产出硕士:

  15/01/28 19时48分52秒INFO站长:驱动程序提交org.apache.spark.deploy.worker.DriverWrapper
15/01/28 19点48分52秒INFO站长:对工人启动驱动程序驱动程序20150128194852-0003工人201501261​​33948-BX-42-151-26286
15/01/28 19点48分55秒INFO站长:删除驱动程序:驱动程序20150128194852-0003
15/01/28 19点48分57秒INFO站长:akka.tcp:// @ driverClient BX-42-68:59290得到的关联,删除它。
15/01/28 19点48分57秒INFO站长:akka.tcp:// @ driverClient BX-42-68:59290得到的关联,删除它。
15/01/28 19点48分57秒WARN ReliableDeliverySupervisor:协会远程系统[akka.tcp:// @ driverClient BX-42-68:59290]失败,地址选通现为[5000]毫秒。理由是:解除关联。
15/01/28 19点48分57秒INFO LocalActorRef:消息[akka.remote.transport.ActorTransportAdapter $ DisassociateUnderlying]从演员[阿卡:// sparkMaster / deadLetters]到Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.16.42.68%3A48091-16#-1393479428]未送达。 [9]一纸空文遇到过。这个日志可以被关闭或配置设置akka.log死信件和akka.log死信件,在关机'调整。

和用于发射的驱动程序输出相应的工人:

  15/01/28 19时48分52秒INFO工人:问发动司机驾驶员20150128194852-0003
15/01/28 19点48分52秒INFO DriverRunner:复制用户罐子HDFS:// BX-42-68:9000 /罐/ pi.jar到/data11/spark-1.2.0-bin-hadoop2.4/work /driver-20150128194852-0003/pi.jar
星火装配已建成蜂巢,包括类路径DataNucleus的罐子
15/01/28 19点48分55秒INFO DriverRunner:启动命令:/opt/apps/jdk-1.7.0_60/bin/java-cp \"/data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar:::/data11/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/data11/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar\" -XX:MaxPermSize参数=128米-Dspark.executor.memory =5克-Dspark.akka.askTimeout = 10,-Dspark.rdd.com preSS =真正的-Dspark.executor.extraJavaOptions = -XX:+ PrintGCDetails -XX:+ PrintGCTimeStamps-Dspark.serializer = org.apache.spark.serializer.KryoSerializer-Dspark.app.name = YANL-Dspark.driver.extraJavaOptions = -XX:MaxPermSize参数= 1024米-Dspark.jars = HDFS:// BX-42-68:9000 /罐/ pi.jar-Dspark.master =火花:// BX-42-68:7077-Dspark.storage。 memoryFraction = 0.6,-Dakka.loglevel =警告,-XX:MaxPermSize参数=1024米-Xms5120M-Xmx5120Morg.apache.spark.deploy.worker.DriverWrapperakka.tcp:// @ sparkWorker BX-42-151:26286 /用户/工作者scratch.Pi
15/01/28 19点48分55秒WARN工作人员:司机驾驶员20150128194852-0003退出失败

我的 spark-env.sh 是:

 出口SCALA_HOME = /选择/应用/斯卡拉-2.11.5
出口JAVA_HOME = /选择/应用/ JDK-1.7.0_60
出口SPARK_HOME = / DATA11 /火花1.2.0彬hadoop2.4
出口PATH = $ JAVA_HOME /斌:$ PATH
出口SPARK_MASTER_IP =`主机名-f`
出口SPARK_LOCAL_IP =`主机名-f`
出口SPARK_DAEMON_JAVA_OPTS = - Dspark.deploy.recoveryMode = ZOOKEEPER -Dspark.deploy.zookeeper.url=10.16.42.68:2181,10.16.42.134:2181,10.16.42.151:2181,10.16.42.150:2181,10.16.42.125:2181 -Dspark.deploy.zookeeper.dir = /火花
SPARK_WORKER_MEMORY =43克
SPARK_WORKER_CORES = 22

和我的火花defaults.conf 是:

  spark.executor.extraJavaOptions -XX:+ PrintGCDetails -XX:+ PrintGCTimeStamps
spark.executor.memory20克
spark.rdd.com preSS真
spark.storage.memoryFraction 0.6
spark.serializer org.apache.spark.serializer.KryoSerializer

然而,当我启动是 客户端 模式下面的命令,它工作正常的程序。

  ./火花提交--class scratch.Pi --deploy模式客户端--executor内存5克--name PI --driver内存5克--driver-java的 - 选项-XX:MaxPermSize参数=1024米--master火花:// BX-42-68:7077 /data11/pi.jar


解决方案

为什么它工作在客户机模式的原因,而不是在群集模式,是因为没有在一个独立的集群集群模式的支持。 (火花文档中提到)。

 另外,如果您的应用程序从一台机器提交远离工人机器(如本地的笔记本电脑),它通常使用集群模式,以降低驱动程序之间的网络延迟执行人。


  

请注意群集模式目前不支持独立
  集群,Mesos集群,或Python应用程序。


如果你看一下提交申请星火文档中的部分,明确提到,群集模式的支持是不是在独立的集群可用。

参考链接: http://spark.apache.org/docs /1.2.0/submitting-applications.html

转到上面链接,看看节火花提交启动应用程序。

认为这将有助于。谢谢你。

I'm new to Spark. Now I encountered a problem: when I launch a program in a standalone spark cluster while command line:

./spark-submit --class scratch.Pi --deploy-mode cluster --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 hdfs://bx-42-68:9000/jars/pi.jar

It will throws following error:

15/01/28 19:48:51 INFO Slf4jLogger: Slf4jLogger started
15/01/28 19:48:51 INFO Utils: Successfully started service 'driverClient' on port 59290.
Sending launch command to spark://bx-42-68:7077
Driver successfully submitted as driver-20150128194852-0003
... waiting before polling master for driver state
... polling master for driver state
State of driver-20150128194852-0003 is FAILED

Master of cluster outputs following log:

15/01/28 19:48:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
15/01/28 19:48:52 INFO Master: Launching driver driver-20150128194852-0003 on worker worker-20150126133948-bx-42-151-26286
15/01/28 19:48:55 INFO Master: Removing driver: driver-20150128194852-0003
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient@bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient@bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient@bx-42-68:59290] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/28 19:48:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.16.42.68%3A48091-16#-1393479428] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 

And the corresponding worker for launching driver program outputs:

15/01/28 19:48:52 INFO Worker: Asked to launch driver driver-20150128194852-0003
15/01/28 19:48:52 INFO DriverRunner: Copying user jar hdfs://bx-42-68:9000/jars/pi.jar to /data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/28 19:48:55 INFO DriverRunner: Launch Command: "/opt/apps/jdk-1.7.0_60/bin/java" "-cp" "/data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar:::/data11/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/data11/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-XX:MaxPermSize=128m" "-Dspark.executor.memory=5g" "-Dspark.akka.askTimeout=10" "-Dspark.rdd.compress=true" "-Dspark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Dspark.app.name=YANL" "-Dspark.driver.extraJavaOptions=-XX:MaxPermSize=1024m" "-Dspark.jars=hdfs://bx-42-68:9000/jars/pi.jar" "-Dspark.master=spark://bx-42-68:7077" "-Dspark.storage.memoryFraction=0.6" "-Dakka.loglevel=WARNING" "-XX:MaxPermSize=1024m" "-Xms5120M" "-Xmx5120M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker@bx-42-151:26286/user/Worker" "scratch.Pi"
15/01/28 19:48:55 WARN Worker: Driver driver-20150128194852-0003 exited with failure

My spark-env.sh is:

export SCALA_HOME=/opt/apps/scala-2.11.5
export JAVA_HOME=/opt/apps/jdk-1.7.0_60
export SPARK_HOME=/data11/spark-1.2.0-bin-hadoop2.4
export PATH=$JAVA_HOME/bin:$PATH
export SPARK_MASTER_IP=`hostname -f`
export SPARK_LOCAL_IP=`hostname -f`
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=10.16.42.68:2181,10.16.42.134:2181,10.16.42.151:2181,10.16.42.150:2181,10.16.42.125:2181 -Dspark.deploy.zookeeper.dir=/spark"
SPARK_WORKER_MEMORY=43g
SPARK_WORKER_CORES=22

And my spark-defaults.conf is:

spark.executor.extraJavaOptions  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.memory            20g
spark.rdd.compress               true
spark.storage.memoryFraction     0.6
spark.serializer                 org.apache.spark.serializer.KryoSerializer

However, when I launch the program with client mode with following command, it works fine.

./spark-submit --class scratch.Pi --deploy-mode client --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 /data11/pi.jar

解决方案

The reason why it works in "client" mode and not in "cluster" mode is because there is no support for "cluster" mode in a standalone cluster.(mentioned in the spark documentation).

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. 

Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications.

If you look at "Submitting Applications" section in spark documentation, it is clearly mentioned that the support for cluster mode is not available in standalone clusters.

Reference link : http://spark.apache.org/docs/1.2.0/submitting-applications.html

Go to above link and have a look at "Launching Applications with spark-submit" section.

Think it will help. Thanks.

这篇关于在`cluster`模式星火的驱动程序启动以一种不可思议的方式失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆