有没有办法在运行 master 的不同服务器上提交 spark 作业 [英] Is there a way to submit spark job on different server running master

查看:24
本文介绍了有没有办法在运行 master 的不同服务器上提交 spark 作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要安排 spark 作业,因为我们熟悉 apache-airflow,我们希望继续使用它来创建不同的工作流.我在网上搜索,但没有找到在气流上安排火花作业的分步指南,也没有找到在不同服务器上运行主机的选项.

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.

对此的回答将不胜感激.提前致谢.

Answer to this will be highly appreciated. Thanks in advance.

推荐答案

您可以通过 3 种方式使用 Apache Airflow 远程提交 Spark 作业:

There are 3 ways you can submit Spark jobs using Apache Airflow remotely:

(1) 使用 SparkSubmitOperator:此操作符希望您在我们的 Airflow 服务器上安装 spark-submit 二进制文件和 YARN 客户端配置.它使用给定的选项调用 spark-submit 命令,阻塞直到作业完成并返回最终状态.好消息是,它还从 spark-submit 命令 stdout 和 stderr 流式传输日志.

(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.

你真的只需要配置一个 yarn-site.xml 文件,我相信,为了 spark-submit --master yarn --deploy-mode 客户端工作.

You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.

在 YARN 中部署 Application Master 后,Spark 就会在本地运行到 Hadoop 集群.

Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.

如果你真的想要,你可以添加一个 hdfs-site.xmlhive-site.xml 以及从 Airflow 提交(如果可能的话),但至少应该从 YARN 容器类路径中提取 hdfs-site.xml 文件

If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath

(2) 使用SSHOperator:使用此操作符在远程服务器上运行bash命令(通过paramiko库使用SSH协议),如spark-submit.这种方法的好处是您不需要复制 hdfs-site.xml 或维护任何文件.

(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.

(3) 在 Livy 中使用 SimpleHTTPOperator:Livy 是一个开源 REST 接口,用于从任何地方与 Apache Spark 交互.您只需要进行 REST 调用.

(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.

我个人更喜欢 SSHOperator :)

这篇关于有没有办法在运行 master 的不同服务器上提交 spark 作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆