如何并行运行多个Spark作业? [英] How to run multiple Spark jobs in parallel?

查看:505
本文介绍了如何并行运行多个Spark作业?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个spark有一个oracle查询.所以我必须并行运行多个作业,以便所有查询都可以同时触发.

one spark has one oracle query. so I have to run multiple jobs in parallel so that all queries will fire at the same time.

如何并行运行多个作业?

How to run multiple jobs in parallel?

推荐答案

工作安排:

第二,在每个Spark应用程序中,多个作业" (火花动作)如果它们是由不同线程提交的,则可能会同时运行.

Second, within each Spark application, multiple "jobs" (Spark actions) may be running concurrently if they were submitted by different threads.

换句话说,多个线程可以使用一个SparkContext实例,该实例可以提交可能并行运行或不并行运行的多个Spark作业.

In other words, a single SparkContext instance can be used by multiple threads that gives the ability to submit multiple Spark jobs that may or may not be running in parallel.

Spark作业是否并行运行取决于CPU的数量(Spark不会跟踪用于调度的内存使用情况).如果有足够的CPU来处理多个Spark作业中的任务,它们将同时运行.

Whether the Spark jobs run in parallel depends on the number of CPUs (Spark does not track the memory usage for scheduling). If there are enough CPUs to handle the tasks from multiple Spark jobs they will be running concurrently.

但是,如果CPU数量不足,则可以考虑使用公平调度模式(默认为FIFO):

If however the number of CPUs is not enough you may consider using FAIR scheduling mode (FIFO is the default):

在给定的Spark应用程序(SparkContext实例)中,如果多个并行作业是从单独的线程提交的,则它们可以同时运行.在本节中,工作"指的是Spark动作(例如保存,收集)以及需要运行以评估该动作的所有任务. Spark的调度程序是完全线程安全的,并支持该用例,以启用能够处理多个请求(例如,针对多个用户的查询)的应用程序.

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By "job", in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

默认情况下,Spark的调度程序以FIFO方式运行作业.每个作业都分为多个阶段"(例如,映射和简化阶段),第一个作业在所有可用资源上都具有优先级,而其各个阶段都有要启动的任务,则第二个作业具有优先级,依此类推.队列不需要使用整个集群,以后的作业可以立即开始运行,但是如果队列开头的作业很大,那么以后的作业可能会大大延迟.

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.


只是要清除一些东西.


Just to clear things up a bit.

  1. spark-submit将提交一个Spark应用程序以执行(不是Spark作业).单个Spark应用程序可以具有至少个Spark工作.

  1. spark-submit is to submit a Spark application for execution (not Spark jobs). A single Spark application can have at least one Spark job.

RDD动作可能会或可能不会阻塞. SparkContext带有两种提交(或运行)Spark作业的方法,即SparkContext.runJobSparkContext.submitJob,因此,动作是否受阻并不重要,但使用哪种SparkContext方法就可以了非阻塞行为.

RDD actions may or may not be blocking. SparkContext comes with two methods to submit (or run) a Spark job, i.e. SparkContext.runJob and SparkContext.submitJob, and so it does not really matter whether an action is blocking or not but what SparkContext method to use to have non-blocking behaviour.

请注意,"RDD操作方法"已经编写好了,它们的实现使用了任何Spark开发人员押注的内容(主要是SparkContext.runJob,如

Please note that "RDD action methods" are already written and their implementations use whatever Spark developers bet on (mostly SparkContext.runJob as in count):

// RDD.count
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

您必须编写自己的RDD操作(在自定义RDD上),才能在Spark应用程序中具有所需的非阻塞功能.

You'd have to write your own RDD actions (on a custom RDD) to have required non-blocking feature in your Spark app.

这篇关于如何并行运行多个Spark作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆