无法在Kubernetes容器内使用纱线创建Spark会话 [英] Can't create spark session using yarn inside kubernetes pod

查看:84
本文介绍了无法在Kubernetes容器内使用纱线创建Spark会话的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个安装了spark客户端的kubernetes容器.

I have a kubernetes pod with spark client installed.

bash-4.2# spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1.2.6.2.0-205
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
Branch HEAD
Compiled by user jenkins on 2017-08-26T09:32:23Z
Revision a2efc34efde0fd268a9f83ea1861bd2548a8c188
Url git@github.com:hortonworks/spark2.git
Type --help for more information.
bash-4.2#

我可以使用以下命令在客户端和集群模式下成功提交Spark作业:

I can submit a spark job successfully under client and cluster mode using these commands:

${SPARK_HOME}/bin/spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=$PYTHONPATH:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.4-src.zip --master yarn --deploy-mode client --num-executors 50 --executor-cores 4 --executor-memory 3G  --driver-memory 6G my_python_script.py --config=configurations/sandbox.yaml --startdate='2019-01-01' --enddate='2019-08-01'
${SPARK_HOME}/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 ${SPARK_HOME}/lib/spark-examples*.jar 10

但是每当我使用以下任何一项开始会话时:

But whenever I start a session using any of these:

spark-shell --master yarn
pyspark --master yarn

它挂起并因以下错误而超时:

It hangs and times out with this error:

org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

我们还有另一个python脚本,需要创建一个Spark会话.该脚本上的代码是:

We have another python script that needs to create a spark session. The code on that script is:

from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setAll(configs.items())
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()

不确定是否还有其他地方可以检查.这是我们第一次从kubernetes集群内部启动Spark连接.在普通虚拟机中启动Spark会话可以正常工作.不知道在网络连接方面有什么区别.这让我感到困惑,我能够在上方提交Spark作业,但无法创建Spark会话.

Not sure where else to check. This is the first time we are initiating a spark connection from inside a kubernetes cluster. Getting a spark session inside a normal virtual machine works fine. Not sure what is the difference in terms of network connection. It also puzzles me that I was able to submit a spark job above but unable to create a spark session.

任何想法和想法都受到高度赞赏.预先感谢.

Any thoughts and ideas is highly appreciated. Thanks in advance.

推荐答案

在客户端模式下,Spark Driver进程正在您的计算机上运行,​​执行程序在Yarn节点上运行(spark-shell和pyspark提交客户端模式会话).驱动程序和执行程序进行通信的过程应该能够通过网络在两个方向上相互连接.

In client mode Spark Driver process is running on your machine and Executors run on Yarn nodes (spark-shell and pyspark submit client mode sessions). Driver and Executor processes to communicate should be able to connect to each other via network in both directions.

由于以群集模式提交作业很适合您,并且您可以从Kubernetes Pod网络访问Yarn主机,因此该路由很好. 除非明确公开,否则很可能您没有从Yarn群集网络到Pod的网络访问权限,该Pod很可能位于Kubernetes专用网络中.这是我建议您检查的第一件事,以及纱线日志.

Since submitting jobs in cluster mode works for you and you can reach the Yarn master from the Kubernetes Pod network, that route is fine. Most probably you don't have network access from the Yarn cluster network to the Pod, which most probably lives within Kubernetes private network unless exposed explicitly. This is the first thing I would recommend you to check, as well as Yarn logs.

在将Pod公开为可从Yarn群集网络访问之后,您可能需要参考以下spark配置来设置绑定:

After you expose the Pod to be accessible from the Yarn cluster network you may want to refer the following spark configs to setup bindings:

- spark.driver.host
- spark.driver.port
- spark.driver.bindAddress
- spark.blockManager.port

文档中找到其描述.

这篇关于无法在Kubernetes容器内使用纱线创建Spark会话的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆