有没有一种方法可以在运行master的其他服务器上提交spark作业 [英] Is there a way to submit spark job on different server running master

查看:85
本文介绍了有没有一种方法可以在运行master的其他服务器上提交spark作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要安排火花作业,因为我们熟悉apache-airflow,因此我们希望继续进行该操作以创建不同的工作流程。我在网上搜索了内容,但没有找到分步指南来安排有关气流的火花作业,也没有找到在运行主服务器的其他服务器上运行它们的选择。

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.

对此的答案非常高赞赏。
预先感谢。

Answer to this will be highly appreciated. Thanks in advance.

推荐答案

可以使用3种方式远程使用Apache Airflow提交Spark作业:

There are 3 ways you can submit Spark jobs using Apache Airflow remotely:

(1)使用 SparkSubmitOperator :此运算符期望您有一个火花提交二进制文件和YARN Airflow服务器上的客户端配置设置。它使用给定的选项调用spark-submit命令,阻塞直到作业完成并返回最终状态。好消息是,它还可以传输spark-submit命令stdout和stderr的日志。

(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.

我相信,您实际上只需要配置一个yarn-site.xml文件,以进行 spark-submit --master yarn-部署模式客户端正常工作。

You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.

一旦在YARN中部署了Application Master,那么Spark就会在Hadoop集群本地运行。

Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.

如果您确实想要的话,也可以添加 hdfs-site.xml hive-site.xml 来提交从Airflow(如果可能的话)中获取,但是否则,至少应从YARN容器类路径中提取 hdfs-site.xml 文件

If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath

(2)使用 SSHOperator :使用此操作符可在远程服务器上运行bash命令(通过paramiko库使用SSH协议)例如火花提交。这种方法的好处是您无需复制 hdfs-site.xml 或维护任何文件。

(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.

(3) SimpleHTTPOperator 与Livy一起使用:Livy是一个开放源代码的REST接口,用于从任何地方与Apache Spark进行交互。您只需要进行REST调用。

(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.

我个人更喜欢 SSHOperator :)

这篇关于有没有一种方法可以在运行master的其他服务器上提交spark作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆