为什么我不能在Kubernetes(K8s)集群上运行sparkPi示例? [英] Why am I not able to run sparkPi example on a Kubernetes (K8s) cluster?

查看:336
本文介绍了为什么我不能在Kubernetes(K8s)集群上运行sparkPi示例?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

到目前为止,我已经在VMWare Workstation内的VM上启动并运行了一个K8s集群.我正在尝试使用这篇文章使我更清楚了.

I have a K8s cluster up and running, on VMs inside VMWare Workstation, as of now. I'm trying to deploy a Spark application natively using the official documentation from here. However, I also landed on this article which made it clearer, I felt.

现在,较早之前,我的安装程序是在嵌套VM内运行的,基本上我的机器是在Win10上运行的,而我有一个Ubuntu VM,其中有3个为群集运行的VM(我知道这不是最好的主意).

Now, earlier my setup was running inside nested VMs, basically my machine is on Win10 and I had an Ubuntu VM inside which I had 3 more VMs running for the cluster (not the best idea, I know).

当我按照上述文章尝试运行安装程序时,首先在集群内创建了一个名为spark的服务帐户,然后创建了

When I tried to run my setup by following the article mentioned, I first created a service account inside the cluster called spark, then created a clusterrolebinding called spark-role, gave edit as the clusterrole and assigned it to the spark service account so that Spark driver pod has sufficient permissions.

然后我尝试使用以下命令行运行示例SparkPi作业:

I then try to run the example SparkPi job using this command line:

bin/spark-submit \
  --master k8s://https://<k8-cluster-ip>:<k8-cluster-port> \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=2 \
  --conf spark.kubernetes.container.image=kmaster:5000/spark:latest \
  --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 100

它在创建 driver-pod 后几秒钟内失败,进入 Running 状态,大约3秒钟后进入 Error 状态.

And it fails within a few seconds after it has created the driver-pod, it goes into Running state and after like 3 seconds goes into Error state.

给出命令kubectl logs spark-pi-driver 是我得到的日志.

On giving the command kubectl logs spark-pi-driver this is the log I get.

第二个Caused by:总是如上所述,即

The second Caused by: is always either as mentioned above i.e:

  • Caused by: java.net.SocketException: Broken pipe (Write failed)
  • Caused by: okhttp3.internal.http2.ConnectionShutdownException
  • Caused by: java.net.SocketException: Broken pipe (Write failed) or,
  • Caused by: okhttp3.internal.http2.ConnectionShutdownException

日志#2 供参考.

在遇到这种死胡同之后,我尝试给出--deploy-mode client以查看它是否有所作为并获得更多详细的日志.您可以从此处中读取客户端模式和群集模式之间的区别.

After running into dead-ends with this, I tried giving --deploy-mode client to see if it makes a difference and get more verbose logs. You can read the difference between client and cluster mode from here.

在以客户端模式部署作业时,它仍然失败,但是,现在我看到,每次驱动程序pod(现在不是作为pod运行,而是作为本地计算机上的进程运行)尝试创建一个执行程序容器,最后一个进入终止状态时,它会无限循环地尝试创建一个带有附加在容器名称后的计数编号的执行者容器.另外,现在我可以在4040端口上看到Spark UI,但是该工作并没有继续进行,因为它甚至试图创建单个执行程序Pod.

On deploying the job as client mode it still fails, however, now I see that each time the driver pod (now running not as a pod but as a process on the local machine) tries to create an executor pod, it goes into a loop infinitely trying to create an executor pod with a count-number appended to the pod name, as the last one goes into a terminated state. Also, now I can see the Spark UI on the 4040 port but the job doesn't move forward as it's stuck on trying to create even a single executor pod.

我收到日志.

对我来说,这似乎很明显是资源紧缩吗?

To me, this makes it pretty apparent that it's a resource crunch maybe?

因此,可以肯定的是,我删除了嵌套的虚拟机,并在主机上设置了2个新的虚拟机,并使用NAT网络将它们连接起来,并设置了相同的K8s集群.

So to be sure, I delete the nested VMs and setup 2 new VMs on my main machine and connect them using a NAT network and setup the same K8s cluster.

但是现在,当我尝试执行完全相同的操作时,它会失败并显示相同的错误(Broken Pipe/ShutdownException),但现在它告诉我即使在创建驱动程序Pod 时,它也会失败.

But now when I try to do the exact same thing it fails with the same error (Broken Pipe/ShutdownException), except now it tells me that it fails even at creating a driver-pod.

是供参考的日志.

现在我什至无法获取失败原因的日志,因为它甚至从未创建.

Now I can't even fetch logs as to why it fails, because it's never even created.

我为此感到震惊,不知道为什么它失败了.现在,我尝试了很多方法来将它们排除在外,但是到目前为止,除了一种方法(完全不同的解决方案)之外,其他任何方法都没有起作用.

I've broken my head over this and can't figure out why it's failing. Now, I tried out a lot of things to rule them out but so far nothing has worked except one (which is a completely different solution).

我从

I tried the spark-on-k8-operator from GCP from here and it worked for me. I wasn't able to see the Spark UI as it runs briefly but it prints the Pi value in the shell window, so I know it works. I'm guessing, that even this spark-on-k8s-operator 'internally' does the same thing but I really need to be able to deploy it natively, or at least know why it fails.

这里的任何帮助将不胜感激(我知道这是一篇很长的文章).谢谢.

Any help here will be appreciated (I know it's a long post). Thank you.

推荐答案

确保要部署的kubernetes版本与所使用的Spark版本兼容.

Make sure the kubernetes version that you are deploying is compatible with the Spark version that you are using.

Apache Spark使用Kubernetes客户端库与kubernetes集群通信.

Apache Spark uses the Kubernetes Client library to communicate with the kubernetes cluster.

今天,最新的LTS Spark版本是2.4.5,其中包括kubernetes客户端版本4.6.3.

As per today the latest LTS Spark version is 2.4.5 which includes the kubernetes client version 4.6.3.

检查Kubernetes客户端的兼容性矩阵:此处

Checking the compatibility matrix of Kubernetes Client: here

受支持的kubernetes版本一直到v1.17.0.

The supported kubernetes versions go all the way up to v1.17.0.

根据我的个人经验, Apache Spark 2.4.5与kubernetes v1.15.3版本配合使用.我在使用最新版本时遇到了问题.

Based on my personal experience Apache Spark 2.4.5 works well with kubernetes version v1.15.3. I have had problems with more recent versions.

使用不受支持的kubernetes版本时,要获取的日志与您描述的日志相同:

When a not supported kubernetes version is used, the logs to get are as the ones you are describing:

Caused by: java.net.SocketException: Broken pipe (Write failed) or,
Caused by: okhttp3.internal.http2.ConnectionShutdownException

这篇关于为什么我不能在Kubernetes(K8s)集群上运行sparkPi示例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆