调度通过Airflow在Kubernetes上运行的Spark作业 [英] Scheduling Spark Jobs Running on Kubernetes via Airflow

查看:277
本文介绍了调度通过Airflow在Kubernetes上运行的Spark作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个通过Kubernetes pod运行的火花作业.到现在为止,我一直使用Yaml文件手动运行我的作业. 现在,我想通过气流安排我的火花作业. 这是我第一次使用气流,无法弄清楚如何在气流中添加Yaml文件. 据我了解,我可以通过Airflow中的DAG安排我的工作. dag的例子是这样的:

I have a spark job that runs via a Kubernetes pod . Till now I was using an Yaml file to run my jobs manually. Now , I want to schedule my spark jobs via airflow. This is the first time I am using airflow and I am unable to figure out how I can add my Yaml file in the airflow. From what I have read is that I can schedule my jobs via a DAG in Airflow. A dag example is this :

from airflow.operators import PythonOperator
from airflow.models import DAG
from datetime import datetime, timedelta

args = {'owner':'test', 'start_date' : datetime(2019, 4, 3), 'retries': 2, 'retry_delay': timedelta(minutes=1) }
dag = DAG('test_dag', default_args = args, catchup=False)

def print_text1():
    print("hell-world1")

def print_text():
    print('Hello-World2')

t1 = PythonOperator(task_id='multitask1', python_callable=print_text1, dag=dag)
t2 = PythonOperator(task_id='multitask2', python_callable=print_text, dag=dag)
t1 >> t2

在这种情况下,一旦我播放DAG,上述方法将在另一个之后执行. 现在,如果我要执行火花提交作业,该怎么办? 我正在使用Spark 2.4.4

In this case the above methods will get executed on after the other once I play the DAG. Now , in case I want to run a spark submit job , what should I do? I am using Spark 2.4.4

推荐答案

气流具有 PythonOperator ,它只是执行Python代码,很可能不执行除非您在Python代码中提交Spark作业,否则您将对此感兴趣.您可以使用多种运算符:

Airflow has a concept of operators, which represent Airflow tasks. In your example PythonOperator is used, which simply executes Python code and most probably not the one you are interested in, unless you submit Spark job within Python code. There are several operators that you can take use of:

  • BashOperator ,它将为您执行给定的bash脚本.您可以直接使用kubectlspark-submit来运行
  • KubernetesPodOperator ,为您创建Kubernetes吊舱,您可以直接使用它启动您的Driver吊舱
  • 混合溶液,例如. HttpOperator + Kubernetes上的Livy ,您在Kubernetes上启动了Livy服务器,它充当Spark Job Server,并提供由Airflow HttpOperator调用的REST API
  • BashOperator, which executes the given bash script for you. You may run kubectl or spark-submit using it directly
  • SparkSubmitOperator, the specific operator to call spark-submit
  • KubernetesPodOperator, creates Kubernetes pod for you, you can launch your Driver pod directly using it
  • Hybrid solutions, eg. HttpOperator + Livy on Kubernetes, you spin up Livy server on Kubernetes, which serves as a Spark Job Server and provides REST API to be called by Airflow HttpOperator

注意:对于每个运营商,您需要确保Airflow环境包含执行所需的所有依赖项以及配置为访问所需服务的凭据.

Note: for each of the operators you need to ensure that your Airflow environment contains all the required dependencies for execution as well as the credentials configured to access the required services.

还可以引用现有线程:

这篇关于调度通过Airflow在Kubernetes上运行的Spark作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆