Apache Spark:客户端和集群部署模式的区别 [英] Apache Spark: Differences between client and cluster deploy modes

查看:62
本文介绍了Apache Spark:客户端和集群部署模式的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL;DR: 在 Spark Standalone 集群中,客户端和集群部署模式有什么区别?如何设置我的应用程序将在哪种模式下运行?

<小时>

我们有一个包含三台机器的 Spark Standalone 集群,所有机器都使用 Spark 1.6.1:

  • 一台主机,也是我们使用 spark-submit
  • 运行我们的应用程序的地方
  • 2 台相同的工作机器

来自

所以我无法测试两种模式以查看实际差异.话虽如此,我的问题是:

1) Spark Standalone client 部署模式和 cluster 部署模式之间的实际区别是什么?使用每种方法的优缺点是什么?

2) 如何使用 spark-submit 选择我的应用程序将在哪一个上运行?

解决方案

Spark Standalone 客户端的实际区别是什么部署模式和集群部署模式?什么是优点和缺点使用每一个?

让我们试着看看客户端和集群模式之间的区别.

客户:

  • 驱动程序在专用进程内的专用服务器(主节点)上运行.这意味着它拥有所有可用资源来执行工作.
  • Driver 打开一个专用的 Netty HTTP 服务器并将指定的 JAR 文件分发给所有 Worker 节点(很大的优势).
  • 由于 Master 节点拥有自己的专用资源,因此您无需为 Driver 程序花费"工作器资源.
  • 如果驱动程序进程终止,您需要一个外部监控系统来重置它的执行.

集群:

  • 驱动程序在集群的工作节点之一上运行.Worker 由 Master 领导选择
  • 驱动程序作为专用的、独立的进程在 Worker 内运行.
  • 驱动程序占用至少 1 个内核和来自其中一个工作程序的专用内存量(可以配置).
  • 可以使用 --supervise 标志从主节点监控驱动程序,并在它死掉时重置.
  • 在集群模式下工作时,与应用程序执行相关的所有 JAR 都需要对所有工作人员公开可用.这意味着您可以手动将它们放置在共享位置或每个工作人员的文件夹中.

哪个更好?不确定,这实际上是由你来试验和决定的.这不是更好的决定,您可以从前者和后者中获益,这取决于您看哪个更适合您的用例.

<块引用>

如何选择我的应用程序将在哪一个上运行,使用 spark-submit

选择运行模式的方法是使用 --deploy-mode 标志.从 Spark 配置 页面:

/bin/spark-submit --class <main-class>--master --deploy-mode --conf <键>=<值>... # 其他选项<应用程序罐>[应用参数]

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?


We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:

  • A master machine, which also is where our application is run using spark-submit
  • 2 identical worker machines

From the Spark Documentation, I read:

(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.

Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:

So I am not able to test both modes to see the practical differences. That being said, my questions are:

1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

2) How to I choose which one my application is going to be running on, using spark-submit?

解决方案

What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?

Let's try to look at the differences between client and cluster mode.

Client:

  • Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
  • Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
  • Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
  • If the driver process dies, you need an external monitoring system to reset it's execution.

Cluster:

  • Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
  • Driver runs as a dedicated, standalone process inside the Worker.
  • Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
  • Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
  • When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.

How to I choose which one my application is going to be running on, using spark-submit

The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:

/bin/spark-submit 
  --class <main-class>
  --master <master-url> 
  --deploy-mode <deploy-mode> 
  --conf <key>=<value> 
  ... # other options
  <application-jar> 
  [application-arguments]

这篇关于Apache Spark:客户端和集群部署模式的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆