火花提交/火花壳>客户模式和集群模式之间的区别 [英] Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode

查看:79
本文介绍了火花提交/火花壳>客户模式和集群模式之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用YARN运行Spark.

I am running Spark with YARN.

通过链接: http://spark.apache.org/docs/latest/running-on-yarn.html

我找到了不同的纱线模式的解释,即-master 选项,Spark可以通过该选项运行:

I found explanation of different yarn modes, i.e. the --master option, with which Spark can run:

有两种部署模式可用于在YARN上启动Spark应用程序.在纱线群集模式下,Spark驱动程序在由YARN管理的群集上的应用程序主进程中运行,并且客户端可以离开启动应用程序后.在yarn-client模式下,驱动程序在客户端进程中运行,并且应用程序主控仅用于从YARN请求资源.

"There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN"

在这里,我只能理解其中的区别是驱动程序在哪里运行,但是我不明白哪个驱动程序运行得更快. Morevover:

Hereby, I can only understand the difference is that where the driver is running, but I can not understand which is running faster. Morevover:

  • 如果运行Spark-submit,则--master可以是 client cluster
  • 相应地,Spark-shell的主选项可以是 yarn-client ,但不支持 cluster 模式
  • In case of running Spark-submit, the --master can be either client or cluster
  • Correspondingly Spark-shell's master option can be yarn-client but it does not support cluster mode

所以我不知道如何做出选择,即何时使用spark-shell,何时使用spark-submit,尤其是何时使用 client 模式,何时使用 cluster 模式

So I do not know how to make the choice, i.e. when to use spark-shell, when to use spark-submit, especially when to use client mode, when to use cluster mode

推荐答案

spark-shell应该用于交互式查询,它必须在yarn-client模式下运行,以便您所运行的计算机充当驱动程序

spark-shell should be used for interactive queries, it needs to be run in yarn-client mode so that the machine you're running on acts as the driver.

对于火花提交",您将作业提交到集群,然后任务在集群中运行.通常,您将在群集模式下运行,以便YARN可以将驱动程序分配给具有可用资源的群集上合适的节点.

For spark-submit, you submit jobs to the cluster then the task runs in the cluster. Normally you would run in cluster mode so that YARN can assign the driver to a suitable node on the cluster with available resources.

某些命令(例如.collect())会将所有数据发送到驱动程序节点,这可能导致驱动程序节点位于群集内部还是群集外部的计算机(例如,用户笔记本电脑)之间存在显着的性能差异.

Some commands (like .collect()) send all the data to the driver node, which can cause significant performance differences between whether your driver node is inside the cluster, or on a machine outside the cluster (e.g. a users laptop).

这篇关于火花提交/火花壳>客户模式和集群模式之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆