如何从 Airflow 向 EMR 集群提交 Spark 作业? [英] How to submit Spark jobs to EMR cluster from Airflow?

查看:75
本文介绍了如何从 Airflow 向 EMR 集群提交 Spark 作业?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在 EMR 主集群(由 Terraform 创建)和 Airflow 之间建立连接.我在具有相同 SG、VPC 和子网的 AWS EC2 服务器下设置了气流.

我需要解决方案,以便 Airflow 可以与 EMR 对话并执行 Spark 提交.

下面的代码将列出活动和终止的 EMR 集群,我也可以微调以获得活动集群:-

fromairflow.contrib.hooks.aws_hook import AwsHook导入 boto3hook = AwsHook(aws_conn_id='aws_default')客户端 = hook.get_client_type('emr', 'eu-central-1')对于 a 中的 x:打印(x[‘状态’][‘状态’],x[‘名称’])

我的问题是 - 如何更新我上面的代码可以执行 Spark 提交操作

解决方案

虽然它可能不会直接解决您的特定查询,但从广义上讲,这里有一些方法可以触发 spark-submit on (remote) EMR 通过 Airflow

  1. 使用 Apache Livy

    • 这个解决方案实际上是独立于远程服务器的,即EMR
    • 这是一个例子
    • 缺点是 Livy 处于早期阶段,它的 API 出现 不完整wonky对我来说
  2. 使用EmrSteps API

    • 依赖于远程系统:EMR
    • 健壮,但由于它本质上是异步,因此您还需要一个 EmrStepSensor(与 EmrAddStepsOperator 一起)
    • 在单个 EMR 集群上,您不能同时运行多个步骤(尽管有些hacky 解决方法存在)
  3. 使用SSHHook/SSHOperator

    • 再次独立于远程系统
    • 相对容易上手
    • 如果您的 spark-submit 命令涉及大量参数,那么(以编程方式)构建该命令会变得很麻烦

<小时>

EDIT-1

好像还有一种直截了当的方式

  1. 指定远程master-IP

<小时>

有用的链接

How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.

I need solutions so that Airflow can talk to EMR and execute Spark submit.

https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/

These blogs have understanding on execution after connection has been established.(Didn't help much)

In airflow I have made a connection using UI for AWS and EMR:-

Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-

from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
    client = hook.get_client_type(‘emr’, ‘eu-central-1’)
    for x in a:
        print(x[‘Status’][‘State’],x[‘Name’])

My question is - How can I update my above code can do Spark-submit actions

解决方案

While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow

  1. Use Apache Livy

    • This solution is actually independent of remote server, i.e., EMR
    • Here's an example
    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me
  2. Use EmrSteps API

    • Dependent on remote system: EMR
    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)
    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
  3. Use SSHHook / SSHOperator

    • Again independent of remote system
    • Comparatively easier to get started with
    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome


EDIT-1

There seems to be another straightforward way

  1. Specifying remote master-IP

    • Independent of remote system
    • Needs modifying Global Configurations / Environment Variables
    • See @cricket_007's answer for details


Useful links

这篇关于如何从 Airflow 向 EMR 集群提交 Spark 作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆