Apache Spark:“无法启动org.apache.spark.deploy.worker.Worker".或硕士 [英] Apache Spark: "failed to launch org.apache.spark.deploy.worker.Worker" or Master
问题描述
我已经在运行Ubuntu8.bb的Ubuntu14.04上的Openstack上创建了一个Spark集群.我创建了两个虚拟机,每个虚拟机的大小均为3gb(对于父操作系统,则保留2gb).此外,我从第一台虚拟机创建一个master和2个worker,从第二台虚拟机创建3个worker.
I have created a Spark cluster on Openstack running on Ubuntu14.04 with 8gb of ram. I created two virtual machines with 3gb each (keeping 2 gb for the parent OS). Further, i create a master and 2 workers from first virtual machine and 3 workers from second machine.
spark-env.sh文件的基本设置为
The spark-env.sh file has basic setting with
export SPARK_MASTER_IP=10.0.0.30
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_CORES=1
每当我使用start-all.sh部署集群时,都会无法启动org.apache.spark.deploy.worker.Worker",有时会无法启动org.apache.spark.deploy.master".掌握".当我看到日志文件以查找错误时,得到以下信息
Whenever i deploy the cluster with start-all.sh, i get "failed to launch org.apache.spark.deploy.worker.Worker" and some times "failed to launch org.apache.spark.deploy.master.Master". When i see the log file to look for error i get the following
Spark命令:/usr/lib/jvm/java-7-openjdk-amd64/bin/java -cp>/home/ubuntu/spark-1.5.1/sbin/../conf/:/home/ubuntu/spark->1.5.1/assembly/target/scala-2.10/spark-assembly-1.5.1->hadoop2.2.0.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-api- > jdo-3.2.6.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-core-> 3.2.10.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-rdbms-> 3.2.9.jar -Xms1g -Xmx1g -XX:MaxPermSize = 256m> org.apache.spark.deploy.master.Master --ip 10.0.0.30 --port 7077 --webui->端口8080
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/bin/java -cp >/home/ubuntu/spark-1.5.1/sbin/../conf/:/home/ubuntu/spark->1.5.1/assembly/target/scala-2.10/spark-assembly-1.5.1->hadoop2.2.0.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-api->jdo-3.2.6.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-core->3.2.10.jar:/home/ubuntu/spark-1.5.1/lib_managed/jars/datanucleus-rdbms->3.2.9.jar -Xms1g -Xmx1g -XX:MaxPermSize=256m >org.apache.spark.deploy.master.Master --ip 10.0.0.30 --port 7077 --webui->port 8080
尽管我收到了失败消息,但主服务器或工作器在几秒钟后仍然存活.
Though I get the fail message but the master or worker become alive after a few seconds.
有人可以解释原因吗?
推荐答案
Spark配置系统杂乱无章的环境变量,参数标志和Java属性文件.我花了几个小时来追踪相同的警告,并详细介绍了Spark初始化过程,这就是我所发现的:
The Spark configuration system is a mess of environment variables, argument flags, and Java Properties files. I just spent a couple hours tracking down the same warning, and unraveling the Spark initialization procedure, and here's what I found:
-
sbin/start-all.sh
调用sbin/start-master.sh
(然后是sbin/start-slaves.sh
) -
sbin/start-master.sh
调用sbin/spark-daemon.sh start org.apache.spark.deploy.master.Master ...
-
sbin/spark-daemon.sh start ...
派生对bin/spark-class org.apache.spark.deploy.master.Master ...
的调用,捕获生成的进程ID(pid),休眠2秒,然后检查pid的命令名称是否为"java" -
bin/spark-class
是bash脚本,因此它以命令名称"bash"开始,然后继续:
sbin/start-all.sh
callssbin/start-master.sh
(and thensbin/start-slaves.sh
)sbin/start-master.sh
callssbin/spark-daemon.sh start org.apache.spark.deploy.master.Master ...
sbin/spark-daemon.sh start ...
forks off a call tobin/spark-class org.apache.spark.deploy.master.Master ...
, captures the resulting process id (pid), sleeps for 2 seconds, and then checks whether that pid's command's name is "java"bin/spark-class
is a bash script, so it starts out with the command name "bash", and proceeds to:
- 通过采购
bin/load-spark-env.sh
(重新)加载Spark环境
- 找到
java
可执行文件 - 找到合适的Spark jar
- 调用
java ... org.apache.spark.launcher.Main ...
以获得Spark部署所需的完整类路径 - 然后最终将控制权通过
exec
移交给java ... org.apache.spark.deploy.master.Master
,此时命令名称变为"java"
- (re-)load the Spark environment by sourcing
bin/load-spark-env.sh
- finds the
java
executable - finds the right Spark jar
- calls
java ... org.apache.spark.launcher.Main ...
to get the full classpath needed for a Spark deployment - then finally hands over control, via
exec
, tojava ... org.apache.spark.deploy.master.Master
, at which point the command name becomes "java"
如果步骤4.1到4.5花费的时间超过2秒,这在我(和您)的经验中似乎是不可避免的,而在以前从未运行过java
的新操作系统上,您将获得启动失败"的信息消息,尽管实际上没有任何失败.
If steps 4.1 through 4.5 take longer than 2 seconds, which in my (and your) experience seems pretty much inevitable on a fresh OS where java
has never been previously run, you'll get the "failed to launch" message, despite nothing actually having failed.
从属会因为相同的原因而抱怨,并反复跳动直到主机真正可用为止,但是他们应该继续重试,直到他们成功连接到主机为止.
The slaves will complain for the same reason, and thrash around until the master is actually available, but they should keep retrying until they successfully connect to the master.
我已经在EC2上运行了一个非常标准的Spark部署;我使用:
I've got a pretty standard Spark deployment running on EC2; I use:
-
conf/spark-defaults.conf
设置spark.executor.memory
并通过spark.{driver,executor}.extraClassPath
添加一些自定义jar
-
conf/spark-env.sh
设置SPARK_WORKER_CORES=$(($(nproc) * 2))
-
conf/slaves
列出我的奴隶
conf/spark-defaults.conf
to setspark.executor.memory
and add some custom jars viaspark.{driver,executor}.extraClassPath
conf/spark-env.sh
to setSPARK_WORKER_CORES=$(($(nproc) * 2))
conf/slaves
to list my slaves
这是我绕过某些{bin,sbin}/*.sh
雷区/迷宫开始Spark部署的方式:
Here's how I start a Spark deployment, bypassing some of the {bin,sbin}/*.sh
minefield/maze:
# on master, with SPARK_HOME and conf/slaves set appropriately
mapfile -t ARGS < <(java -cp $SPARK_HOME/lib/spark-assembly-1.6.1-hadoop2.6.0.jar org.apache.spark.launcher.Main org.apache.spark.deploy.master.Master | tr '\0' '\n')
# $ARGS now contains the full call to start the master, which I daemonize with nohup
SPARK_PUBLIC_DNS=0.0.0.0 nohup "${ARGS[@]}" >> $SPARK_HOME/master.log 2>&1 < /dev/null &
我仍在使用sbin/start-daemon.sh
来启动从站,因为这比在ssh
命令中调用nohup
更容易:
I'm still using sbin/start-daemon.sh
to start the slaves, since that's easier than calling nohup
within the ssh
command:
MASTER=spark://$(hostname -i):7077
while read -r; do
ssh -o StrictHostKeyChecking=no $REPLY "$SPARK_HOME/sbin/spark-daemon.sh start org.apache.spark.deploy.worker.Worker 1 $MASTER" &
done <$SPARK_HOME/conf/slaves
# this forks the ssh calls, so wait for them to exit before you logout
那里!它假设我正在使用所有默认端口和东西,并且我没有像在文件名中添加空格那样愚蠢,但我认为这样更干净.
There! It assumes that I'm using all the default ports and stuff, and that I'm not doing stupid shit like putting whitespace in filenames, but I think it's cleaner this way.
这篇关于Apache Spark:“无法启动org.apache.spark.deploy.worker.Worker".或硕士的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!