Airflow和Spark / Hadoop-唯一集群或一个用于Airflow,另一个集群用于Spark / Hadoop [英] Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

查看:350
本文介绍了Airflow和Spark / Hadoop-唯一集群或一个用于Airflow,另一个集群用于Spark / Hadoop的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出哪种方法可以与Airflow和Spark / Hadoop一起使用。
我已经有一个Spark / Hadoop集群,正在考虑为Airflow创建另一个集群,该集群将作业远程提交给Spark / Hadoop集群。

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop. I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.

任何关于它的建议?从另一个集群远程部署spark看起来有点复杂,并且会创建一些文件配置重复。

Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.

推荐答案

我相信,需要配置 yarn-site.xml 文件,以便为 spark-submit --master yarn --deploy-mode client 开始工作。 (您可以尝试集群部署模式,但是我认为由Airflow管理驱动程序不是一个坏主意)

You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)

一旦在YARN中部署了Application Master,则Spark在Hadoop集群本地运行。

Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.

如果确实需要,可以添加 hdfs-site.xml hive -site.xml 也要从Airflow提交(如果可能的话),否则至少应从YARN容器类路径中拾取hdfs-site.xml文件(并非所有NodeManager都可以具有在其上安装的Hive客户端)

If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)

这篇关于Airflow和Spark / Hadoop-唯一集群或一个用于Airflow,另一个集群用于Spark / Hadoop的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆