Spark:同时从具有不同内存/内核配置的单个 JVM 作业启动 [英] Spark: launch from single JVM jobs with different memory/cores configs simultaneously

查看:31
本文介绍了Spark:同时从具有不同内存/内核配置的单个 JVM 作业启动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有带独立管理器的 Spark 集群,其中的作业通过在客户端应用程序中创建的 SparkSession 进行调度.客户端应用程序在 JVM 上运行.为了提高性能,您必须使用不同的配置启动每个作业,请参阅下面的作业类型示例.

Suppose you have Spark cluster with Standalone manager, where jobs are scheduled through SparkSession created at client app. Client app runs on JVM. And you have to launch each job with different configs for the sake of performance, see Job types example below.

问题是您无法从单个 JVM 创建两个会话.

那么您将如何同时启动具有不同会话配置的多个 Spark 作业?

不同的会话配置我的意思是:

By different session configs I mean:

  • spark.executor.cores
  • spark.executor.memory
  • spark.kryoserializer.buffer.max
  • spark.scheduler.pool

解决问题的可能方法:

  1. 为同一 SparkSession 中的每个 Spark 作业设置不同的会话配置.有可能吗?
  2. 启动另一个 JVM 只是为了启动另一个 SparkSession,我可以称之为 Spark 会话服务.但是您永远不知道将来会同时启动多少具有不同配置的作业.目前 - 我一次只需要 2-3 个不同的配置.这可能足够了,但不够灵活.
  3. 使用相同的配置为各种作业创建全局会话.但从性能的角度来看,这种方法是有底线的.
  4. 仅将 Spark 用于繁重的工作,并在 Spark 之外运行所有快速搜索任务.但这是一团糟,因为您需要将另一个解决方案(如 Hazelcast)与 Spark 并行,并在它们之间分配资源.此外,这会给所有人带来额外的复杂性:部署、支持等.
  1. Set different session configs for each Spark job within the same SparkSession. Is it possible?
  2. Launch another JVM just to start another SparkSession, something that I could call Spark session service. But you never knew how many jobs with different configs you gonna launch in future simultaneously. At the moment - I need only 2-3 different configs at a time. It's may be enough but not flexible.
  3. Make global session with the same configs for all kinds of jobs. But this approach is a bottom from perspective of performance.
  4. Use Spark only for heavy jobs, and run all quick search tasks outside Spark. But that's a mess, since you need to keep another solution (like Hazelcast) in parallel with Spark, and split resources between them. Moreover, that brings extra complexity for all: deployment, support etc.

工作类型示例

  1. 转储庞大的数据库任务.这是 CPU 低但 IO 密集型的长时间运行任务.因此,您可能希望以低内存和每个执行程序的内核启动尽可能多的执行程序.
  2. 处理转储结果的繁重任务.这是 CPU 密集型的,因此您将在每台集群机器上启动一个执行程序,并具有最大的 CPU 和内核.
  3. 快速检索数据任务,需要每台机器一个执行器和最少的资源.
  4. 介于 1-2 和 3 之间的某个值,其中一项作业应该占用一半的集群资源.

推荐答案

Spark Standalone 为应用程序使用简单的 FIFO 调度程序.默认情况下,每个应用程序都使用集群中的所有可用节点.每个应用程序、每个用户或全局可以限制节点的数量.其他资源,例如内存、CPU 等,可以通过应用程序的 SparkConf 对象进行控制.

Spark standalone uses a simple FIFO scheduler for applications. By default, each application uses all the available nodes in the cluster. The number of nodes can be limited per application, per user, or globally. Other resources, such as memory, cpus, etc. can be controlled via the application’s SparkConf object.

Apache Mesos 有主进程和从进程.Master 向应用程序(在 Apache Mesos 中称为框架)提供资源,该应用程序要么接受该提议,要么不接受该提议.因此,声明可用资源和运行作业是由应用程序本身决定的.Apache Mesos 允许对系统中的资源进行细粒度控制,例如 CPU、内存、磁盘和端口.Apache Mesos 还提供对资源的粗粒度控制控制,其中 Spark 预先为每个执行程序分配固定数量的 CPU,这些 CPU 在应用程序退出之前不会释放.注意,在同一个集群中,一些应用程序可以设置为使用细粒度控制,而其他应用程序设置为使用粗粒度控制.

Apache Mesos has a master and slave processes. The master makes offers of resources to the application (called a framework in Apache Mesos) which either accepts the offer or not. Thus, claiming available resources and running jobs is determined by the application itself. Apache Mesos allows fine-grained control of the resources in a system such as cpus, memory, disks, and ports. Apache Mesos also offers course-grained control control of resources where Spark allocates a fixed number of CPUs to each executor in advance which are not released until the application exits. Note that in the same cluster, some applications can be set to use fine-grained control while others are set to use course-grained control.

Apache Hadoop YARN 有一个包含两部分的 ResourceManager,一个是调度器,一个是 ApplicationsManager.调度程序是一个可插拔的组件.提供了两种实现,一种是在由多个组织共享的集群中有用的 CapacityScheduler,另一种是确保所有应用程序平均获得相同数量资源的 FairScheduler.两个调度程序都将应用程序分配给一个队列,每个队列都获得在它们之间平等共享的资源.在队列中,资源在应用程序之间共享.ApplicationsManager 负责接受作业提交并启动特定于应用程序的 ApplicationsMaster.在这种情况下,ApplicationsMaster 是 Spark 应用程序.在 Spark 应用程序中,资源在应用程序的 SparkConf 对象中指定.

Apache Hadoop YARN has a ResourceManager with two parts, a Scheduler, and an ApplicationsManager. The Scheduler is a pluggable component. Two implementations are provided, a CapacityScheduler, useful in a cluster shared by more than one organization, and the FairScheduler, which ensures all applications, on average, get an equal number of resources. Both schedulers assign applications to a queues and each queue gets resources that are shared equally between them. Within a queue, resources are shared between the applications. The ApplicationsManager is responsible for accepting job submissions and starting the application specific ApplicationsMaster. In this case, the ApplicationsMaster is the Spark application. In the Spark application, resources are specified in the application’s SparkConf object.

对于您的情况,仅使用独立版本是不可能的,可能有一些前提解决方案,但我没有遇到过

For your case just with standalone it is not possible may be there can be some premise solutions but I haven't faced

这篇关于Spark:同时从具有不同内存/内核配置的单个 JVM 作业启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆