并行运行EMR的步骤 [英] Running steps of EMR in parallel
问题描述
我正在 EMR集群上进行工作,我所面临的问题是全部
I am running a spark-job on EMR cluster,The issue i am facing is all the
触发的EMR作业正在逐步执行(队列中)
EMR jobs triggered are executing in steps (in queue)
有什么办法可以让它们并行运行 如果没有的话
Is there any way to make them run parallel if not is there any alteration for that
推荐答案
默认情况下,Elastic MapReduce的YARN设置非常面向步骤",带有单个CapacityScheduler队列,分配了100%的集群资源.由于采用了这种配置,因此每次将作业提交到EMR群集时,YARN都会最大限度地利用该作业的群集使用率,并为该作业分配所有可用资源,直到完成为止.
Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that single job, granting all available resources to it until it finishes.
在EMR集群(或任何其他基于YARN的Hadoop集群)中运行多个并发作业,需要使用带有多个队列的正确YARN设置,以正确地为每个作业分配资源. YARN的文档很好地介绍了Capacity Scheduler的所有功能,听起来很简单.
Running multiple concurrent jobs in an EMR cluster (or any other YARN based Hadoop cluster, in fact) requires a proper YARN setup with multiple queues to properly grant resources to each job. YARN's documentation is quite good about all of the Capacity Scheduler features and it is simpler as it sounds.
YARN的FairScheduler颇受欢迎,但是它使用了不同的方法,根据您的需要,可能会更难配置.在最简单的情况下,您只有一个公平队列,因此YARN会尝试通过运行作业将容器释放后,立即将其分配给等待的作业,以确保提交给集群的所有作业至少能获得一部分计算资源.只要可用.
YARN's FairScheduler is quite popular but it uses a different approach and may be a bit more difficult to configure depending on your needs. Given the simplest scenario where you have a single Fair queue, YARN will try to grant containers to waiting jobs as soon as they are freed by running jobs, ensuring that all the jobs submitted to a cluster get at least a fraction of compute resources as soon as they are available.
这篇关于并行运行EMR的步骤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!