Apache Spark-Spark中的内部作业调度程序如何定义什么是用户和什么是池 [英] Apache Spark - How does internal job scheduler in spark define what are users and what are pools

查看:122
本文介绍了Apache Spark-Spark中的内部作业调度程序如何定义什么是用户和什么是池的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很抱歉在这里有点笼统,但是对于作业调度在内部如何在Spark中工作感到有些困惑.从文档此处,我明白了它是Hadoop Fair Scheduler的某种实现.

I am sorry about being a little general here, but I am a little confused about how job scheduling works internally in spark. From the documentation here I get that it is some sort of implementation of Hadoop Fair Scheduler.

我无法理解用户到底是谁(Linux用户,hadoop用户,spark客户?).我也无法理解此处如何定义池.例如,在我的hadoop集群中,我已将资源分配给了两个不同的池(让我们将它们称为团队1和团队2).但是在Spark集群中,不同的池是否会实例化,并且其中的用户会实例化自己的Spark上下文?再次让我怀疑,当我将属性设置为spark.scheduler.pool时,我应该传递什么参数.

I am unable to come around to understand that who exactly are users here (are the linux users, hadoop users, spark clients?). I am also unable to understand how are the pools defined here. For example, In my hadoop cluster I have given resource allocation to two different pools (lets call them team 1 and team 2). But in spark cluster, wont different pools and the users in them instantiate their own spark context? Which again brings me to question that what parameters do I pass when I am setting property to spark.scheduler.pool.

我对驱动程序如何实例化spark上下文并将其拆分为任务和作业有基本的了解.可能是我在这里完全没有指出要点,但我真的很想了解Spark的内部调度程序如何在操作,任务和工作的上下文中工作

I have a basic understanding of how driver instantiates a spark context and then splits them into task and jobs. May be I am missing the point completely here but I would really like to understand how Spark's internal scheduler works in context of actions, tasks and job

推荐答案

默认情况下,spark与FIFO调度程序配合使用,在FIFO调度程序中,以FIFO方式执行作业.

By default spark works with FIFO scheduler where jobs are executed in FIFO manner.

但是,如果您的群集位于YARN上,则YARN具有可插拔的调度程序,这意味着在YARN中,您可以选择自己的调度程序.如果使用CDH分发的YARN,则默认情况下将具有FAIR调度程序,但也可以使用Capacity Scheduler.

But if you have your cluster on YARN, YARN has pluggable scheduler, it means in YARN you can scheduler of your choice. If you are using YARN distributed by CDH you will have FAIR scheduler by deafult but you can also go for Capacity scheduler.

如果您使用的是HDP分发的YARN,则默认情况下将具有CAPACITY调度程序,如果需要,您可以移至FAIR.

If you are using YARN distributed by HDP you will have CAPACITY scheduler by default and you can move to FAIR if you need that.

Scheduler如何与Spark配合使用?

How Scheduler works with spark?

我假设您在YARN上拥有火花集群.

I'm assuming that you have your spark cluster on YARN.

当您提交Spark作业时,它首先命中您的资源经理.现在,您的资源管理器负责所有调度和分配资源.因此,它与在Hadoop中提交作业基本相同.

When you submit a job in spark, it first hits your resource manager. Now your resource manager is responsible for all the scheduling and allocating resources. So its basically same as that of submitting a job in Hadoop.

调度程序如何工作?

公平调度是一种为作业分配资源的方法,以使所有作业在一段时间内平均获得相等的资源份额.当有一个作业正在运行时,该作业将使用整个群集.当提交其他作业时,将腾出的任务插槽分配给新作业,以便每个作业获得大致相同的CPU时间(通过抢占杀死所有使用过的任务).与默认的Hadoop调度程序(FIFO)形成一个作业队列不同,它可以使短作业在合理的时间内完成,而不会使长作业耗尽.这也是在多个用户之间共享集群的一种合理方法.最后,公平共享也可以与工作优先级一起使用-优先级被用作权重,以确定每个工作应获得的总计算时间的比例.

Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time(using preemption killing all over used tasks). Unlike the default Hadoop scheduler(FIFO), which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job should get.

CapacityScheduler 旨在允许共享大型群集,同时为每个组织提供最小的容量保证.中心思想是Hadoop Map-Reduce集群中的可用资源在多个组织之间进行划分,这些组织根据计算需求共同为集群提供资金.组织还有一个额外的好处,即组织可以访问任何多余的容量而不会被他人使用.这样可以以经济高效的方式为组织提供弹性.

The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity no being used by others. This provides elasticity for the organizations in a cost-effective manner.

这篇关于Apache Spark-Spark中的内部作业调度程序如何定义什么是用户和什么是池的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆