如何在不产生单独的JVM的情况下并行提交多个Spark应用程序? [英] How to submit multiple Spark applications in parallel without spawning separate JVMs?

查看:87
本文介绍了如何在不产生单独的JVM的情况下并行提交多个Spark应用程序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题是您需要启动单独的JVM来创建每个作业具有不同RAM数量的单独会话.

The problem is that you need to launch separate JVM to create separate session with different number of RAM per job.

如何在不手动生成单独的JVM的情况下同时提交几个Spark应用程序?

How to submit few Spark applications simultaneously without manually spawning separate JVMs?

我的应用程序在单个JVM中的单个服务器上运行.似乎每个JVM范例的Spark会话都有问题. Spark范式说:

My app is run on single server, within single JVM. That appears a problem with Spark session per JVM paradigm. Spark paradigm says:

1 JVM => 1 app => 1 session => 1 context => 1 RAM/executors/cores config

我希望每个Spark应用程序具有不同的配置,而无需手动启动额外的JVM.配置:

I'd like to have different configurations per Spark application without launching extra JVMs manually. Configurations:

  1. spark.executor.cores
  2. spark.executor.memory
  3. spark.dynamicAllocation.maxExecutors
  4. spark.default.parallelism
  1. spark.executor.cores
  2. spark.executor.memory
  3. spark.dynamicAllocation.maxExecutors
  4. spark.default.parallelism

用例

您已开始长期运行,需要4-5个小时才能完成.该作业在具有配置spark.executor.memory=28GBspark.executor.cores=2的会话中运行.现在,您要根据用户需求启动5-10秒的工作,而不必等待4-5个小时.这个细小的工作需要1GB的RAM. 您会做什么?代表长期运行的工作阶段"提交微小的工作吗?比它将要求28GB((

Usecase

You have started long running job, say 4-5 hours to complete. The job is run within a session with configs spark.executor.memory=28GB, spark.executor.cores=2. Now you want to launch 5-10 seconds job on user demand, without waiting 4-5 hours. This tinny job need 1GB of RAM. What would you do? Submit tinny job from behalf of long-running-job-session? Than it will claim 28GB ((

  1. Spark允许您仅在会话级别上配置CPU和执行程序的数量. 火花调度池仅允许您滑动和切块数量的内核,而不是RAM或执行器,对吧?
  2. Spark作业服务器.但是他们不支持Spark版本高于2.0,对我来说不是一个选择.但是,它们实际上解决了2.0版之前的问题. 在Spark JobServer功能中,他们说Separate JVM per SparkContext for isolation (EXPERIMENTAL),表示spawn new JVM per context
  3. 已弃用Mesos细粒度模式
  4. 此黑客,但是在生产中使用它太冒险了.
  5. 用于作业提交的隐藏Apache Spark REST API,请阅读.绝对可以在其中指定执行程序的内存和内核,但是 提交具有不同配置的两个作业的行为是什么? 据我了解,这是 Livy .不熟悉它,但看起来他们只有Java API用于批量提交,这对我来说不是一个选择.
  1. Spark allow you to configure number of CPU and executors only on the session level. Spark scheduling pool allow you to slide and dice only number of cores, not a RAM or executors, right?
  2. Spark Job Server. But they does't support Spark newer than 2.0, not an option for me. But they actually solve the problem for versions older than 2.0. In Spark JobServer features they said Separate JVM per SparkContext for isolation (EXPERIMENTAL), which means spawn new JVM per context
  3. Mesos fine-grained mode is deprecated
  4. This hack, but it's too risky to use it in production.
  5. Hidden Apache Spark REST API for job submission, read this and this. There is definitely way to specify executor memory and cores there, but still what is the behavior on submitting two jobs with different configs? As I understand this is Java REST client for it.
  6. Livy. Not familiar with it, but looks they have Java API only for batch submission, which is not an option for me.

推荐答案

对于一个用例,现在情况更加清晰了.有两种可能的解决方案:

With a use case, this is much clearer now. There are two possible solutions:

如果您需要这些作业之间共享数据,请使用FAIR-scheduler 和一个(REST-)前端(SparkJobServer,Livy等也是如此).您也不需要使用SparkJobServer,如果您有固定的作用域,那么它应该相对容易编码.我已经看到项目朝着这个方向发展.您所需要的只是一个事件循环,以及一种将传入查询转换为Spark查询的方法.从某种意义上说,我希望有一个库来涵盖此用例,因为在基于Spark的应用程序/框架上工作时,几乎总是要构建的第一件事. 在这种情况下,您可以根据您的硬件调整执行程序的大小,Spark将管理您的作业计划.借助Yarn的动态资源分配,如果您的框架/应用程序处于空闲状态, Yarn还将释放资源(杀死执行程序). 有关更多信息,请阅读此处: http://spark.apache.org/docs/latest/job-scheduling.html

If you require shared data between those jobs, use the FAIR-scheduler and a (REST-)frontend (as does SparkJobServer, Livy, etc.). You don't need to use SparkJobServer either, it should be relatively easy to code, if you have a fixed scope. I've seen projects go in that direction. All you need is an event loop and a way to translate your incoming queries into Spark queries. In a way, I would expect there to be demand for a library to cover this use case, since it's pretty much always the first thing you have to build, when you work on a Spark-based application/framework. In this case, you can size your executors according to your hardware, Spark will manage scheduling of your jobs. With Yarn's dynamic resource allocation, Yarn will also free resources (kill executors), should your framework/app be idle. For more information, read here: http://spark.apache.org/docs/latest/job-scheduling.html

如果不需要共享数据,请使用YARN (或其他资源管理器)以公平的方式将资源分配给两个作业. YARN具有公平的调度模式,您可以设置每个应用程序的资源需求.如果您认为这很适合您,但是您需要共享数据,那么您可能要考虑使用Hive或Alluxio提供数据接口.在这种情况下,您将运行两个spark-submit,并在集群中维护多个驱动程序.围绕火花提交"构建附加的自动化功能可以帮助您减少烦恼,并使最终用户更加透明.这种方法也是高延迟的,因为资源分配和SparkSession初始化占用的时间或多或少是恒定的.

If you don't need shared data, use YARN (or another resource manager) to assign resources in a fair manner to both jobs. YARN has a fair scheduling mode, and you can set the resource demands per application. If you think this suits you, but you need shared data, then you might want to think about using Hive or Alluxio to provide a data interface. In this case you would run two spark-submits, and maintain multiple drivers in the cluster. Building additional automation around spark-submit can help you make this less annoying and more transparent to end users. This approach is also high-latency, since resource allocation and SparkSession initialization take up a more or less constant amount of time.

这篇关于如何在不产生单独的JVM的情况下并行提交多个Spark应用程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆