使用spark-on-k8s-operator在Kubernetes上运行的Pyspark的依赖性问题 [英] Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

查看:1297
本文介绍了使用spark-on-k8s-operator在Kubernetes上运行的Pyspark的依赖性问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我花了几天的时间来弄清楚我在Kubernetes上运行(Py)Spark遇到的依赖问题.我正在使用 spark-on-k8s-operator 和Spark的Google云连接器.

I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector.

当我尝试使用sparkctl create sparkjob.yaml ...和以下.yaml文件提交没有依赖项的火花作业时,它的工作原理就像是一种魅力.

When I try to submit my spark job without a dependency using sparkctl create sparkjob.yaml ... with below .yaml file, it works like a charm.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-job
  namespace: my-namespace
spec:
  type: Python
  pythonVersion: "3"
  hadoopConf:
    "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
    "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
    "fs.gs.project.id": "our-project-id"
    "fs.gs.system.bucket": "gcs-bucket-name"
    "google.cloud.auth.service.account.enable": "true"
    "google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/keyfile.json"
  mode: cluster
  image: "image-registry/spark-base-image"
  imagePullPolicy: Always
  mainApplicationFile: ./sparkjob.py
  deps:
    jars:
      - https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.4.5/spark-sql-kafka-0-10_2.11-2.4.5.jar
  sparkVersion: "2.4.5"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 2.4.5
    serviceAccount: spark-operator-spark
    secrets:
    - name: "keyfile"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: our-project-id
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.5
    secrets:
    - name: "keyfile"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: our-project-id

Docker镜像spark-base-image是使用Dockerfile构建的

The Docker image spark-base-image is built with Dockerfile

FROM gcr.io/spark-operator/spark-py:v2.4.5

RUN rm $SPARK_HOME/jars/guava-14.0.1.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/28.0-jre/guava-28.0-jre.jar $SPARK_HOME/jars

ADD https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-2.0.1/gcs-connector-hadoop2-2.0.1-shaded.jar $SPARK_HOME/jars

ENTRYPOINT [ "/opt/entrypoint.sh" ]

提交应用程序时,主应用程序文件将上传到GCS,然后从那里获取并在启动应用程序时复制到驱动程序窗格中.每当我想提供自己的Python模块deps.zip作为依赖项以便能够在主应用程序文件sparkjob.py中使用它时,问题就开始出现.

the main application file is uploaded to GCS when submitting the application and subsequently fetched from there and copied into the driver pod upon starting the application. The problem starts whenever I want to supply my own Python module deps.zip as a dependency to be able to use it in my main application file sparkjob.py.

这是我到目前为止尝试过的:

Here's what I have tried so far:

1

在sparkjob.yaml中的spark.deps中添加了以下几行

Added the following lines to spark.deps in sparkjob.yaml

pyFiles:
   - ./deps.zip

这导致操作员甚至无法提交错误的Spark应用程序

which resulted in the operator not being able to even submit the Spark application with error

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

./deps.zip已与主应用程序文件一起成功上传到GCS存储桶,但可以从GCS成功获取主应用程序文件(我在上面没有定义的依赖项的日志中看到了这一点),无法以某种方式从那里获取.我还尝试了将gcs-connector jar明确添加到spark.deps.jars列表中-没有任何变化.

./deps.zip is successfully uploaded to the GCS bucket along with the main application file but while the main application file can be successfully fetched from GCS (I see this in the logs in jobs with no dependencies as defined above), ./deps.zip can somehow not be fetched from there. I also tried adding the gcs-connector jar to the spark.deps.jars list explicitly - nothing changes.

2

通过在上述Dockerfile中添加COPY ./deps.zip /mnt/并通过以下方式在sparkjob.yaml中添加依赖项,我将./deps.zip添加到了用于启动驱动程序和执行程序pod的基本docker映像中:

I added ./deps.zip to the base docker image used for starting up the driver and executor pods by adding COPY ./deps.zip /mnt/ to the above Dockerfile and adding the dependency in the sparkjob.yaml via

pyFiles:
    - local:///mnt/deps.zip

这次可以提交spark作业并启动驱动程序pod,但是在初始化Spark上下文时出现file:/mnt/deps.zip not found错误 我还尝试在Dockerfile中另外设置ENV SPARK_EXTRA_CLASSPATH=/mnt/,但没有成功.我什至尝试使用卷安装将整个/mnt/目录显式安装到驱动程序和执行程序pod中,但这还是行不通的.

This time the spark job can be submitted and the driver pod is started, however I get a file:/mnt/deps.zip not found error when the Spark context is being initialized I also tried to additionally set ENV SPARK_EXTRA_CLASSPATH=/mnt/ in the Dockerfile but without any success. I even tried to explicitly mount the whole /mnt/ directory into the driver and executor pods using volume mounts, but that also didn't work.

我的解决方法(2),向Docker映像添加依赖项并在Dockerfile中设置ENV SPARK_EXTRA_CLASSPATH=/mnt/确实有效!原来标签没有更新,而且我一直都在使用旧版本的Docker映像. h

My workaround (2), adding dependencies to the Docker image and setting ENV SPARK_EXTRA_CLASSPATH=/mnt/ in the Dockerfile actually worked! Turns out the tag didn't update and I've been using an old version of the Docker image all along. Duh.

我仍然不知道为什么通过gcs-connector的(更优雅的)解决方案1无法正常工作,但它可能与

I still don't know why the (more elegant) solution 1 via the gcs-connector isn't working, but it might be related to MountVolume.Setup failed for volume "spark-conf-volume"

推荐答案

使用python依赖项的Google Cloud Storage路径,因为它们已上传到那里.

Use the Google Cloud Storage path to the python dependencies since they're uploaded there.

spec:
  deps:
    pyFiles:
      - gs://gcs-bucket-name/deps.zip

这篇关于使用spark-on-k8s-operator在Kubernetes上运行的Pyspark的依赖性问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆