如何从 Airflow 向 EMR 集群提交 Spark 作业? [英] How to submit Spark jobs to EMR cluster from Airflow?
问题描述
如何在 EMR 主集群(由 Terraform 创建)和 Airflow 之间建立连接.我在具有相同 SG、VPC 和子网的 AWS EC2 服务器下设置了气流.
我需要解决方案,以便 Airflow 可以与 EMR 对话并执行 Spark 提交.
下面的代码将列出活动和终止的 EMR 集群,我也可以微调以获得活动集群:-
fromairflow.contrib.hooks.aws_hook import AwsHook导入 boto3hook = AwsHook(aws_conn_id='aws_default')客户端 = hook.get_client_type('emr', 'eu-central-1')对于 a 中的 x:打印(x[‘状态’][‘状态’],x[‘名称’])
我的问题是 - 如何更新我上面的代码可以执行 Spark 提交操作
虽然它可能不会直接解决您的特定查询,但从广义上讲,这里有一些方法可以触发 spark-submit
on (remote) EMR
通过 Airflow
使用
Apache Livy
使用
EmrSteps
API
- 依赖于远程系统:
EMR
- 健壮,但由于它本质上是异步,因此您还需要一个
EmrStepSensor
(与EmrAddStepsOperator
一起) - 在单个
EMR
集群上,您不能同时运行多个步骤(尽管有些hacky 解决方法存在)
- 依赖于远程系统:
使用
SSHHook
/SSHOperator
- 再次独立于远程系统
- 相对容易上手
- 如果您的
spark-submit
命令涉及大量参数,那么(以编程方式)构建该命令会变得很麻烦
<小时>
EDIT-1
好像还有一种直截了当的方式
指定远程
master
-IP- 独立于远程系统
- 需要修改全局配置/环境变量
- 查看@cricket_007的回答了解详情
<小时>
有用的链接
- 这个来自 @Kaxil Naik 本人:有没有办法在运行master的不同服务器上提交spark作业
- Spark通过在 Livy 上提交批处理 POST 方法并跟踪作业,使用 Airflow 提交作业
- 远程火花提交到在 EMR 上运行的 YARN
How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.
I need solutions so that Airflow can talk to EMR and execute Spark submit.
These blogs have understanding on execution after connection has been established.(Didn't help much)
In airflow I have made a connection using UI for AWS and EMR:-
Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-
from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
client = hook.get_client_type(‘emr’, ‘eu-central-1’)
for x in a:
print(x[‘Status’][‘State’],x[‘Name’])
My question is - How can I update my above code can do Spark-submit actions
While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit
on (remote) EMR
via Airflow
Use
Apache Livy
- This solution is actually independent of remote server, i.e.,
EMR
- Here's an example
- The downside is that
Livy
is in early stages and itsAPI
appears incomplete and wonky to me
- This solution is actually independent of remote server, i.e.,
Use
EmrSteps
API
- Dependent on remote system:
EMR
- Robust, but since it is inherently async, you will also need an
EmrStepSensor
(alongsideEmrAddStepsOperator
) - On a single
EMR
cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
- Dependent on remote system:
Use
SSHHook
/SSHOperator
- Again independent of remote system
- Comparatively easier to get started with
- If your
spark-submit
command involves a lot of arguments, building that command (programmatically) can become cumbersome
EDIT-1
There seems to be another straightforward way
Specifying remote
master
-IP- Independent of remote system
- Needs modifying Global Configurations / Environment Variables
- See @cricket_007's answer for details
Useful links
- This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master
- Spark job submission using Airflow by submitting batch POST method on Livy and tracking job
- Remote spark-submit to YARN running on EMR
这篇关于如何从 Airflow 向 EMR 集群提交 Spark 作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!