Spark yarn 集群与客户端 - 如何选择使用哪一个? [英] Spark yarn cluster vs client - how to choose which one to use?

查看:23
本文介绍了Spark yarn 集群与客户端 - 如何选择使用哪一个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

火花 docs 具有以下内容描述纱线客户端和纱线集群区别的段落:

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster:

有两种部署模式可用于在 YARN 上启动 Spark 应用程序.在集群模式下,Spark 驱动程序在集群上由 YARN 管理的应用程序主进程中运行,客户端可以在启动应用程序后离开.客户端模式下,驱动程序运行在客户端进程中,应用master只用于向YARN请求资源.

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

出于某种原因,我假设有两种选择.如果是这样,您如何选择使用哪一个?

I'm assuming there are two choices for a reason. If so, how do you choose which one to use?

请使用事实来证明您的回答是正确的,以便此问题和答案符合 stackoverflow 的要求.

Please use facts to justify your response so that this question and answer(s) meet stackoverflow's requirements.

stackoverflow 上有几个类似的问题,但是这些问题关注的是两种方法之间的区别,而不是关注一种方法何时比另一种方法更合适.

There are a few similar questions on stackoverflow, however those questions focus on the difference between the two approaches, but don't focus on when one approach is more suitable than the other.

推荐答案

一种常见的部署策略是从与您的工作机器物理上位于同一位置的网关机器(例如独立 EC2 集群中的主节点)提交您的应用程序.在此设置中,客户端模式是合适的.在客户端模式下,驱动程序直接在作为集群客户端的 spark-submit 进程中启动.应用程序的输入和输出附加到控制台.因此,这种模式特别适用于涉及 REPL 的应用程序(例如 Spark shell).

A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).

或者,如果您的应用程序是从远离工作机器的机器提交的(例如在您的笔记本电脑本地),通常使用集群模式来最小化驱动程序和执行程序之间的网络延迟.请注意,Mesos 集群目前不支持集群模式.目前只有 YARN 支持 Python 应用程序的集群模式." -- 提交应用程序

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications." -- Submitting Applications

我由此了解到,这两种策略都使用集群来分发任务;不同之处在于驱动程序"运行的位置:在本地使用 spark-submit,或者也在集群中.

What I understand from this is that both strategies use the cluster to distribute tasks; the difference is where the "driver program" runs: locally with spark-submit, or, also in the cluster.

上面引用中详细说明了何时应该使用它们中的任何一个,但我还做了另一件事:对于大罐子,我使用 rsync 将它们复制到集群(甚至主节点)) 100倍网速,然后从集群提交.对于大罐子,这可能比集群模式"更好.请注意,客户端模式可能不会将 jar 传输到主服务器.在这一点上,2 之间的差异是最小的.当驱动程序大部分时间处于空闲状态时,客户端模式可能会更好,以充分利用本地机器上的内核,并且可能避免将 jar 传输到 master(即使在环回接口上,一个大 jar 也需要相当多的秒).使用客户端模式,您可以在任何集群节点上传输 (rsync) jar.

When you should use either of them is detailed in the quote above, but I also did another thing: for big jars, I used rsync to copy them to the cluster (or even to master node) with 100 times the network speed, and then submitted from the cluster. This can be better than "cluster mode" for big jars. Note that client mode does not probably transfer the jar to the master. At that point the difference between the 2 is minimal. Probably client mode is better when the driver program is idle most of the time, to make full use of cores on the local machine and perhaps avoid transferring the jar to the master (even on loopback interface a big jar takes quite a bit of seconds). And with client mode you can transfer (rsync) the jar on any cluster node.

另一方面,如果驱动非常密集,在cpu或I/O上,集群模式可能更合适,可以更好地平衡集群(在客户端模式下,本地机器会同时运行驱动程序和尽可能多的尽可能多地工作,使其过载并使本地任务变慢,从而使整个工作最终可能会等待来自本地机器的几个任务).

On the other hand, if the driver is very intensive, in cpu or I/O, cluster mode may be more appropriate, to better balance the cluster (in client mode, the local machine would run both the driver and as many workers as possible, making it over loaded and making it that local tasks will be slower, making it such that the whole job may end up waiting for a couple of tasks from the local machine).

  • 总而言之,如果我与集群在同一个本地网络中,我会使用客户端模式并从我的笔记本电脑提交.如果集群是远一点,我要么用集群模式在本地提交,要么 rsyncjar 到远程集群并在那里提交,在客户端或集群模式,取决于驱动程序的重量资源.*
  • To sum up, if I am in the same local network with the cluster, I would use the client mode and submit it from my laptop. If the cluster is far away, I would either submit locally with cluster mode, or rsync the jar to the remote cluster and submit it there, in client or cluster mode, depending on how heavy the driver program is on resources.*

AFAIK 由于驱动程序在集群中运行,因此不太容易受到远程断开连接而导致驱动程序和整个 Spark 作业崩溃的影响.这对于流处理类型的工作负载等长时间运行的作业尤其有用.

这篇关于Spark yarn 集群与客户端 - 如何选择使用哪一个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆