基于 Docker 的解决方案中的 Apache Spark 和 Apache Airflow 连接 [英] Apache Spark and Apache Airflow connection in Docker based solution

查看:17
本文介绍了基于 Docker 的解决方案中的 Apache Spark 和 Apache Airflow 连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 Spark 和 Airflow 集群,我想将 Spark 作业从 Airflow 容器发送到 Spark 容器.但我对 Airflow 不熟悉,我不知道我需要执行哪种配置.我将 spark_submit_operator.py 复制到 plugins 文件夹下.

I have Spark and Airflow cluster, I want to send a spark job from Airflow container to Spark container. But I am new about Airflow and I don't know which configuration I need to perform. I copied spark_submit_operator.py under the plugins folder.

from airflow import DAG

from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta

    args = {
        'owner': 'airflow',
        'start_date': datetime(2018, 7, 31)
    }
    dag = DAG('spark_example_new', default_args=args, schedule_interval="*/10 * * * *")

    operator = SparkSubmitOperator(
        task_id='spark_submit_job',
        conn_id='spark_default',
        java_class='Simple',
        application='/spark/abc.jar',
        total_executor_cores='1',
        executor_cores='1',
        executor_memory='2g',
        num_executors='1',
        name='airflow-spark-example',
        verbose=False,
        driver_memory='1g',
        application_args=["1000"],
        conf={'master':'spark://master:7077'},
        dag=dag,
    )

ma​​ster 是我们的 Spark Master 容器的主机名.当我运行 dag 时,它会产生以下错误:

master is the hostname of our Spark Master container. When I run the dag, it produce following error:

[2018-09-20 05:57:46,637] {{models.py:1569}} INFO - Executing <Task(SparkSubmitOperator): spark_submit_job> on 2018-09-20T05:57:36.756154+00:00
[2018-09-20 05:57:46,637] {{base_task_runner.py:124}} INFO - Running: ['bash', '-c', 'airflow run spark_example_new spark_submit_job 2018-09-20T05:57:36.756154+00:00 --job_id 4 --raw -sd DAGS_FOLDER/firstJob.py --cfg_path /tmp/tmpn2hznb5_']
[2018-09-20 05:57:47,002] {{base_task_runner.py:107}} INFO - Job 4: Subtask spark_submit_job [2018-09-20 05:57:47,001] {{settings.py:174}} INFO - setting.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800
[2018-09-20 05:57:47,312] {{base_task_runner.py:107}} INFO - Job 4: Subtask spark_submit_job [2018-09-20 05:57:47,311] {{__init__.py:51}} INFO - Using executor CeleryExecutor
[2018-09-20 05:57:47,428] {{base_task_runner.py:107}} INFO - Job 4: Subtask spark_submit_job [2018-09-20 05:57:47,428] {{models.py:258}} INFO - Filling up the DagBag from /usr/local/airflow/dags/firstJob.py
[2018-09-20 05:57:47,447] {{base_task_runner.py:107}} INFO - Job 4: Subtask spark_submit_job [2018-09-20 05:57:47,447] {{cli.py:492}} INFO - Running <TaskInstance: spark_example_new.spark_submit_job 2018-09-20T05:57:36.756154+00:00 [running]> on host e6dd59dc595f
[2018-09-20 05:57:47,471] {{logging_mixin.py:95}} INFO - [2018-09-20 05:57:47,470] {{spark_submit_hook.py:283}} INFO - Spark-Submit cmd: ['spark-submit', '--master', 'yarn', '--conf', 'master=spark://master:7077', '--num-executors', '1', '--total-executor-cores', '1', '--executor-cores', '1', '--executor-memory', '2g', '--driver-memory', '1g', '--name', 'airflow-spark-example', '--class', 'Simple', '/spark/ugur.jar', '1000']

[2018-09-20 05:57:47,473] {{models.py:1736}} ERROR - [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1633, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/spark_submit_operator.py", line 168, in execute
    self._hook.submit(self._application)
  File "/usr/local/lib/python3.6/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line 330, in submit
    **kwargs)
  File "/usr/local/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/usr/local/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'

它正在运行命令:

Spark-Submit cmd: ['spark-submit', '--master', 'yarn', '--conf', 'master=spark://master:7077', '--num-executors', '1', '--total-executor-cores', '1', '--executor-cores', '1', '--executor-memory', '2g', '--driver-memory', '1g', '--name', 'airflow-spark-example', '--class', 'Simple', '/spark/ugur.jar', '1000']

但我没有使用纱线.

推荐答案

如果您使用 SparkSubmitOperator,则与 master 的连接将设置为Yarn";默认情况下,无论您在 Python 代码中设置了哪个 master,您都可以通过在其构造函数中指定 conn_id 来覆盖 master,条件是您已经在Admin->Connection"处创建了上述 conn_idAirflow Web 界面中的菜单.希望能帮到你.

If you use SparkSubmitOperator the connection to master will be set as "Yarn" by default regardless master which you set on your python code, however, you can override master by specifying conn_id through its constructor with the condition you have already created the aforementioned conn_id at "Admin->Connection" menu in Airflow Web Interface. I hope it can help you.

这篇关于基于 Docker 的解决方案中的 Apache Spark 和 Apache Airflow 连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆