Apache Spark:客户端和群集部署模式之间的差异 [英] Apache Spark: Differences between client and cluster deploy modes

查看:224
本文介绍了Apache Spark:客户端和群集部署模式之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR::在Spark Standalone集群中,客户端和集群部署模式之间有什么区别?如何设置应用程序将在哪种模式下运行?

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?

我们有一个Spark Standalone集群,其中包含三台机器,所有机器都带有Spark 1.6.1:

We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:

  • 一台主计算机,这也是使用spark-submit
  • 运行我们的应用程序的地方
  • 两台相同的工作机
  • A master machine, which also is where our application is run using spark-submit
  • 2 identical worker machines

Spark文档中,我读到:

(...)对于独立集群,Spark当前支持两种部署模式.在客户端模式下,驱动程序以与提交应用程序的客户端相同的过程启动.但是,在群集模式下,驱动程序是从群集内的一个Worker进程中启动的,并且客户端进程在履行其提交应用程序的职责而无需等待应用程序完成时立即退出.

(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

但是,通过阅读本文,我并不能真正理解实际的差异,也无法了解不同部署模式的优缺点.

However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.

此外,当我使用start-submit启动应用程序时,即使将属性spark.submit.deployMode设置为"cluster",上下文的Spark UI也会显示以下条目:

Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:

因此,我无法测试这两种模式以查看实际差异.话虽如此,我的问题是:

So I am not able to test both modes to see the practical differences. That being said, my questions are:

1)Spark Standalone client 部署模式和 cluster 部署模式之间有什么实际区别?使用每一个的利弊是什么?

1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

2)如何使用spark-submit选择要在哪个应用程序上运行?

2) How to I choose which one my application is going to be running on, using spark-submit?

推荐答案

Spark Standalone客户端之间的实际区别是什么 部署模式和集群部署模式?利弊是什么 每个人都用吗?

What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

让我们尝试看看客户端和集群模式之间的区别.

Let's try to look at the differences between client and cluster mode.

客户:

  • 驱动程序在专用进程内的专用服务器(主节点)上运行.这意味着它拥有所有可用资源来执行工作.
  • 驱动程序打开一个专用的Netty HTTP服务器,并将指定的JAR文件分发到所有Worker节点(有很大的优势).
  • 因为主节点拥有自己的专用资源,所以您不需要为驱动程序"花费工作"资源.
  • 如果驱动程序进程终止,则需要一个外部监视系统来重置其执行.

集群:

  • 驱动程序在群集的Worker节点之一上运行.工人是由领导者选拔的.
  • 驱动程序作为专用的独立进程运行在工作程序内.
  • 驱动程序至少占用 个1核和一个工作线程中的专用内存(可以配置).
  • 可以使用--supervise标志从主节点监视驱动程序,并在驱动程序死后将其重置.
  • 在集群模式下工作时,所有与您的应用程序执行相关的JAR都必须对所有工作人员公开可用.这意味着您可以将它们手动放置在每个工人的共享位置或文件夹中.
  • Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
  • Driver runs as a dedicated, standalone process inside the Worker.
  • Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
  • Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
  • When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

哪个更好?不确定,这实际上是您可以尝试和决定的.这不是一个更好的决定,您会从前者和后者中受益,这取决于您哪种情况对您的用例更好.

Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.

如何选择要在我的应用程序上运行的应用程序, 使用spark-submit

How to I choose which one my application is going to be running on, using spark-submit

选择运行模式的方法是使用--deploy-mode标志.在火花配置页面:

The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:

/bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

这篇关于Apache Spark:客户端和群集部署模式之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆