EMR中的资源优化/利用,可用于长时间运行的作业和多个小型运行的作业 [英] Resource optimization/utilization in EMR for long running job and multiple small running jobs

查看:132
本文介绍了EMR中的资源优化/利用,可用于长时间运行的作业和多个小型运行的作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的用例:

  • 我们有一个运行时间很长的Spark工作.在此之后,称为 LRJ .这项工作每周运行一次.
  • 我们有多个随时可以执行的小型运行作业.这些 作业比长期运行的作业具有更高的优先级.
  • We have a long running Spark job. Here after called, LRJ. This job runs once in a week.
  • We have multiple small running jobs that can come at any time. These jobs has high priority than the long running job.

为解决此问题,我们如下创建了YARN队列:

创建了用于资源管理的YARN队列.为长时间运行的作业配置了Q1队列,为小型运行的作业配置了Q2队列.

Created YARN Queues for resource management. Configured Q1 queue for long running job and Q2 queue for small running jobs.

Config:
     Q1 : capacity = 50% and it can go upto 100%
          capacity on CORE nodes = 50% and maximum 100%   
     Q2 : capacity = 50% and it can go upto 100%
          capacity on CORE nodes = 50% and maximum 100% 

我们面临的问题:

进行LRJ时,它将获取所有资源. LRJ已获取所有资源时,正在等待多个小型正在运行的作业.集群扩展后,可获得新资源,小型正在运行的作业将获得资源.但是,由于群集要花一些时间进行扩展活动,因此在分配资源给这些作业时会造成很大的延迟.

When LRJ is in progress, it acquires all the resources. Multiple small running jobs waits as LRJ has acquired all the resources. Once the cluster scales up and new resources are available small running jobs get resources. However, because cluster takes time for scaling-up activity, this creates a significant delay in allocating resources to these jobs.

更新1: 我们已经尝试根据此处

Update 1: We have tried using maximum-capacity config as per YARN docs but its not working as I posted in my other question here

推荐答案

经过更多分析,其中涉及与一些不起眼的英雄进行的讨论,我们决定根据用例对YARN队列应用抢占

With more analysis, which involves discussion with some unsung heroes, we decided to apply preemption on YARN queues as per our use-case.

当发生以下一系列事件时,将抢占Q1队列上的工作:

Jobs on Q1 queue will be preempted when following sequence of events occur:

  1. Q1队列使用的容量超过了指定容量(示例:LRJ作业 正在使用比队列中指定的资源更多的资源.
  2. 第二排队列中的作业突然被安排了(例如:突然触发了多个正在运行的小型作业).
  1. Q1 queue is using more than the specified capacity (Example: LRJ job is using more resources than the specified on queue).
  2. Suddenly jobs on Q2 queue gets scheduled (Example: Suddenly multiple small running jobs get triggered).

要了解抢占,请阅读

To understand preemption, read this and this

以下是我们在AWS CloudFormation脚本中用于启动EMR集群的示例配置:

Following is the sample configuration, that we are using in our AWS CloudFormation script to launch an EMR cluster:

容量调度程序配置:

        yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
        yarn.scheduler.capacity.root.queues: Q1,Q2
        yarn.scheduler.capacity.root.Q2.capacity: 60
        yarn.scheduler.capacity.root.Q1.capacity: 40
        yarn.scheduler.capacity.root.Q2.accessible-node-labels: "*"
        yarn.scheduler.capacity.root.Q1.accessible-node-labels: "*"
        yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity: 100
        yarn.scheduler.capacity.root.Q2.accessible-node-labels.CORE.capacity: 60
        yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.capacity: 40
        yarn.scheduler.capacity.root.Q1.accessible-node-labels.CORE.maximum-capacity: 60
        yarn.scheduler.capacity.root.Q2.disable_preemption: true
        yarn.scheduler.capacity.root.Q1.disable_preemption: false

纱线站点配置:

        yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
        yarn.resourcemanager.scheduler.monitor.enable: true
        yarn.resourcemanager.scheduler.monitor.policies: org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
        yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval: 2000
        yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill: 3000
        yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round: 0.5
        yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity: 0.1
        yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor: 1

使用上述方法,您必须根据用例在特定队列上指定作业.

With the above, you have to specify your jobs on the particular queue based on your use-case.

这篇关于EMR中的资源优化/利用,可用于长时间运行的作业和多个小型运行的作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆