Airflow和Spark / Hadoop-唯一集群或一个用于Airflow,另一个集群用于Spark / Hadoop [英] Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop
问题描述
我正在尝试找出哪种方法可以与Airflow和Spark / Hadoop一起使用。
我已经有一个Spark / Hadoop集群,正在考虑为Airflow创建另一个集群,该集群将作业远程提交给Spark / Hadoop集群。
I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop. I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
任何关于它的建议?从另一个集群远程部署spark看起来有点复杂,并且会创建一些文件配置重复。
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
推荐答案
我相信,需要配置 yarn-site.xml
文件,以便为 spark-submit --master yarn --deploy-mode client
开始工作。 (您可以尝试集群部署模式,但是我认为由Airflow管理驱动程序不是一个坏主意)
You really only need to configure a yarn-site.xml
file, I believe, in order for spark-submit --master yarn --deploy-mode client
to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
一旦在YARN中部署了Application Master,则Spark在Hadoop集群本地运行。
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
如果确实需要,可以添加 hdfs-site.xml
和 hive -site.xml
也要从Airflow提交(如果可能的话),否则至少应从YARN容器类路径中拾取hdfs-site.xml文件(并非所有NodeManager都可以具有在其上安装的Hive客户端)
If you really want, you could add a hdfs-site.xml
and hive-site.xml
to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
这篇关于Airflow和Spark / Hadoop-唯一集群或一个用于Airflow,另一个集群用于Spark / Hadoop的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!