Spark yarn cluster vs client-如何选择使用哪一个? [英] Spark yarn cluster vs client - how to choose which one to use?

查看:242
本文介绍了Spark yarn cluster vs client-如何选择使用哪一个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

火花文档具有以下内容描述纱线客户和纱线簇之间区别的段落:

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster:

有两种部署模式可用于在YARN上启动Spark应用程序.在群集模式下,Spark驱动程序在由YARN在群集上管理的应用程序主进程中运行,并且客户端可以在启动应用程序后消失.在客户端模式下,驱动程序在客户端进程中运行,并且应用程序主控仅用于从YARN请求资源.

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

我假设有两个选择是有原因的.如果是这样,您如何选择要使用哪一个?

I'm assuming there are two choices for a reason. If so, how do you choose which one to use?

请使用事实来证明您的回答合理,以便此问题和答案满足stackoverflow的要求.

Please use facts to justify your response so that this question and answer(s) meet stackoverflow's requirements.

关于堆栈溢出有一些类似的问题,但是这些问题关注的是两种方法之间的差异,而不是关注哪种方法比另一种更合适.

There are a few similar questions on stackoverflow, however those questions focus on the difference between the two approaches, but don't focus on when one approach is more suitable than the other.

推荐答案

常见的部署策略是从与工作计算机物理位于同一位置的网关计算机(例如,独立EC2群集中的主节点)提交应用程序.在这种设置中,客户端模式是合适的.在客户端模式下,驱动程序直接在spark-submit进程内启动,该进程充当集群的客户端.应用程序的输入和输出已附加到控制台.因此,此模式特别适合涉及REPL的应用程序(例如Spark Shell).

A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).

或者,如果您的应用程序是从远离工作机的计算机(例如,笔记本电脑本地)提交的,则通常使用群集模式来最大程度地减少驱动程序和执行程序之间的网络延迟.请注意,Mesos群集当前不支持群集模式.当前只有YARN支持Python应用程序的群集模式.-提交应用程序

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications." -- Submitting Applications

据此我了解到,这两种策略都使用集群来分配任务.区别在于驱动程序"的运行位置:通过spark-submit在本地运行,或者在群集中运行.

What I understand from this is that both strategies use the cluster to distribute tasks; the difference is where the "driver program" runs: locally with spark-submit, or, also in the cluster.

以上引文中详细说明了何时应使用它们之一,但我还做了另一件事:对于大罐子,我使用rsync将它们复制到群集(甚至复制到主节点)上,其倍数是网络速度,然后从群集提交.对于大型罐子,这可能比集群模式"更好.请注意,客户端模式可能不会将jar传输到主服务器.那时,两者之间的差异很小.在大多数情况下,当驱动程序处于空闲状态时,客户机模式可能会更好,以便充分利用本地计算机上的内核,并避免将jar传输到主服务器(即使在回送接口上,一个大jar也会花费相当多的时间) .而且,在客户端模式下,您可以在任何群集节点上传输(同步)jar.

When you should use either of them is detailed in the quote above, but I also did another thing: for big jars, I used rsync to copy them to the cluster (or even to master node) with 100 times the network speed, and then submitted from the cluster. This can be better than "cluster mode" for big jars. Note that client mode does not probably transfer the jar to the master. At that point the difference between the 2 is minimal. Probably client mode is better when the driver program is idle most of the time, to make full use of cores on the local machine and perhaps avoid transferring the jar to the master (even on loopback interface a big jar takes quite a bit of seconds). And with client mode you can transfer (rsync) the jar on any cluster node.

另一方面,如果驱动程序非常密集,则在cpu或I/O中,群集模式可能更合适,以更好地平衡群集(在客户端模式下,本地计算机将同时运行驱动程序和尽可能多的工作人员,使其过载,使本地任务变慢,从而使整个工作最终可能要等待本地计算机上的几个任务.

On the other hand, if the driver is very intensive, in cpu or I/O, cluster mode may be more appropriate, to better balance the cluster (in client mode, the local machine would run both the driver and as many workers as possible, making it over loaded and making it that local tasks will be slower, making it such that the whole job may end up waiting for a couple of tasks from the local machine).

  • 总而言之,如果我与群集位于同一个本地网络中,我会 使用客户端模式并从我的笔记本电脑提交.如果集群是 很远的地方,我要么使用群集模式在本地提交,要么rsync 罐子到远程集群,然后在客户端或客户端中提交 集群模式,具体取决于驱动程序的运行强度 资源.*
  • To sum up, if I am in the same local network with the cluster, I would use the client mode and submit it from my laptop. If the cluster is far away, I would either submit locally with cluster mode, or rsync the jar to the remote cluster and submit it there, in client or cluster mode, depending on how heavy the driver program is on resources.*

AFAIK由于驱动程序在群集中运行,因此不容易受到远程断开连接的破坏,从而导致驱动程序和整个Spark作业崩溃.这对于长时间运行的作业(例如流处理类型的工作负载)尤其有用.

这篇关于Spark yarn cluster vs client-如何选择使用哪一个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆