从驱动程序同时运行几个Spark作业 [英] Running several spark jobs concurrently from driver

查看:64
本文介绍了从驱动程序同时运行几个Spark作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下,我们有3个客户,并且我们希望同时为每个客户做一些相同的工作.

Imagine that we have 3 customers and we want do some same work for each of them in parallel.

def doSparkJob(customerId: String) = {
  spark
    .read.json(s"$customerId/file.json")
    .map(...)
    .reduceByKey(...)
    .write
    .partitionBy("id")
    .parquet("output/")
}

我们是这样并行执行的(来自Spark驱动程序):

We do it concurrently like this (from spark driver):

val jobs: Future[(Unit, Unit, Unit)] = for {
  f1 <- Future { doSparkJob("customer1") }
  f2 <- Future { doSparkJob("customer1") }
  f3 <- Future { doSparkJob("customer1") }
} yield (f1, f2, f3)

Await.ready(jobs, 5.hours)

我是否正确理解这是不好的方法?许多火花作业将使执行者相互推论,并且将有许多溢出数据出现在光盘上.如何通过并行作业管理执行任务的火花?当我们从一个驱动程序执行3个并发作业,而只有3个执行程序具有一个核心时,洗牌的出现方式.

Do I understand correctly that this is bad approach? Many spark job will push out context of each other from executors and there will be many spilling data to disc appears. How spark will be manage execute task from parallel jobs? How shuffle appears when we have 3 concurrent job from one driver and only 3 executors with one core.

我想,一个好的方法应该看起来像这样:我们按客户一起读取所有客户groupByKey的所有数据,然后执行我们想做的事情.

I guess, a good approach should looks like this: We read all data together for all customers groupByKey by customer and do what we want to do.

推荐答案

我是否正确理解这是不好的方法?

Do I understand correctly that this is bad approach?

不一定.很大程度上取决于上下文,Spark实现了自己的 AsyncRDDActions 集合来解决这种情况(尽管没有等效的 Dataset ).

Not necessarily. A lot depends on the context and Spark implements it's own set of AsyncRDDActions to address scenarios like this one (though there is no Dataset equivalent).

在最简单的情况下,使用静态分配,由于资源不足,Spark很可能会按顺序调度所有作业.除非另外进行配置,否则使用所描述的配置,这是最可能的结果.请记住,Spark可以将应用程序内调度与FAIR调度程序一起使用,以在多个并发作业之间共享有限的资源.请参阅在应用程序内进行计划.

In the simplest scenario, with static allocation, it is quite likely that Spark will just schedule all jobs sequentially, due to lack of resources. Unless configured otherwise, this is the most probable outcome with the described configuration. Please keep in mind that Spark can use in-application scheduling with FAIR scheduler to share limited resources between multiple concurrent jobs. See Scheduling Within an Application.

如果资源量足以同时启动多个作业,则各个作业之间可能存在竞争,尤其是对于IO和内存密集型作业.如果所有作业都使用相同的资源(尤其是数据库),则Spark可能会导致节流以及随后的故障或超时.运行多个作业的较不严重的影响可以增加缓存逐出.

If amount of resources is sufficient to start multiple jobs at the same, there can be competition between individual jobs, especially for IO and memory intensive jobs. If all jobs use the same of resources (especially databases) it is possible that Spark will cause throttling and subsequent failures or timeouts. A less severe effect of running multiple jobs can be increased cache eviction.

在顺序执行和并发执行之间进行选择时,总体上要考虑多个因素,包括但不限于可用资源(Spark集群和外部服务),API的选择(RDD比SQL更贪婪,因此需要)一些低层管理)和运营商的选择.即使作业按顺序进行,您仍可能决定使用异步来提高驱动程序利用率并减少延迟.这对于Spark SQL和复杂的执行计划(Spark SQL中的常见瓶颈)特别有用.这样,Spark可以制定新的执行计划,同时执行其他作业.

Overall there multiple factors to consider when you choose between sequential and concurrent execution including, but not limited to, available resource (Spark cluster and external services), choice of the API (RDD tend to be more greedy than SQL, therefore requires some low level management) and choice of operators. Even if jobs are sequentially you may still decide to use asynchronous to improve driver utilization and reduce latency. This is particularly useful with Spark SQL and complex execution plans (common bottleneck in Spark SQL). This way Spark can crunch new execution plans, while other jobs are executed.

这篇关于从驱动程序同时运行几个Spark作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆