将来自 Airflow 的 Spark 作业提交到外部 Spark 容器 [英] Submit a spark job from Airflow to external spark container

查看:32
本文介绍了将来自 Airflow 的 Spark 作业提交到外部 Spark 容器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用 docker swarm 构建的火花和气流集群.气流容器不能像我期望的那样包含 spark-submit.

I have a spark and airflow cluster which is built with docker swarm. Airflow container cannot contain spark-submit as I expect.

我正在使用github中存在的以下图像

I am using following images which exist in github

Spark:欧洲大数据/docker-hadoop-spark-workbench

Spark: big-data-europe/docker-hadoop-spark-workbench

气流:puckel/docker-airflow (CeleryExecutor)

Airflow: puckel/docker-airflow (CeleryExecutor)

我准备了一个 .py 文件并将其添加到 dags 文件夹下.

I prepared a .py file and add it under dags folder.

from airflow import DAG

from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta


args = {'owner': 'airflow', 'start_date': datetime(2018, 9, 24) }

dag = DAG('spark_example_new', default_args=args, schedule_interval="@once")

operator = SparkSubmitOperator(task_id='spark_submit_job', conn_id='spark_default', java_class='Main', application='/SimpleSpark.jar', name='airflow-spark-example',conf={'master':'spark://master:7077'},
        dag=dag)

我也在网站中配置了如下连接:

I also configure the connection as folows in web site:

Master 是 spark master 容器的主机名.

Master is the hostname of spark master container.

但是它没有找到spark-submit,它产生以下错误:

But it does not find the spark-submit, it produces following error:

[2018-09-24 08:48:14,063] {{logging_mixin.py:95}} INFO - [2018-09-24 08:48:14,062] {{spark_submit_hook.py:283}} INFO - Spark-Submit cmd: ['spark-submit', '--master', 'spark://master:7077', '--conf', 'master=spark://master:7077', '--name', 'airflow-spark-example', '--class', 'Main', '--queue', 'root.default', '/SimpleSpark.jar']

[2018-09-24 08:48:14,067] {{models.py:1736}} ERROR - [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1633, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/spark_submit_operator.py", line 168, in execute
    self._hook.submit(self._application)
  File "/usr/local/lib/python3.6/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line 330, in submit
    **kwargs)
  File "/usr/local/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/usr/local/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'

推荐答案

据我所知 puckel/docker-airflow 使用 Python slim 图像(https://hub.docker.com/_/python/).此镜像不包含常用包,仅包含运行 python 所需的最小包.因此,您需要扩展映像并在容器上安装 spark-submit.

As far as I know puckel/docker-airflow uses Python slim image(https://hub.docker.com/_/python/). This image does not contain the common packages and only contains the minimal packages needed to run python. Hence, you will need to extend the image and install spark-submit on your container.

编辑:Airflow 确实需要容器中的 spark 二进制文件来运行 SparkSubmitOperator,如文档 此处.

Edit: Airflow does need spark binaries in the container to run SparkSubmitOperator as documented here.

您可以使用的另一种方法是使用 SSHOperator 通过 SSH 连接到远程机器,在外部 VM 上运行 spark-submit 命令.但是这里也应该可以使用 SSH,而 Puckel Airflow 中没有.

The other approach you can use is to use SSHOperator to run spark-submit command on an external VM by SSHing into a remote machine. But here as well SSH should be available which isn't available in Puckel Airflow.

这篇关于将来自 Airflow 的 Spark 作业提交到外部 Spark 容器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆