Spark与执行程序和内核的数量相关联 [英] Spark coalesce relationship with number of executors and cores

查看:118
本文介绍了Spark与执行程序和内核的数量相关联的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在提出一个关于Spark的非常愚蠢的问题,因为我想澄清我的困惑。如果我有一个我想要的输入文件列表(假定为1000),我想在Spark中非常新,并且仍然试图了解它在内部是如何工作的。



处理或写入某处,我想使用coalesce将我的分区数量减少到100个。



现在我运行12个执行程序,每个执行程序5个内核,即60任务运行时。这是否意味着,每个任务都可以在单个分区上独立运行?
$ b


Round:1 12个执行器,每个执行器有5个核心=> 60任务进程60
分区

Round :2个8个执行器,每个执行器有5个核心=> 40个任务 b

处理剩下的40个分区,4个执行器第二次不放置
作业


或者来自同一个执行程序的所有任务都可以在同一个分区上运行?


回合:1:12执行者=>处理12个分区

回合:2:12
执行者=>处理12个分区

回合: 3:12执行者=>
处理12个分区

....

....

....

轮次:9(已处理96个分区
):4个执行程序=>处理剩余的4个
个分区


解决方案


说,如果我有一个输入文件的列表(假设为1000),我想要处理或写入某处并且我想


默认情况下为spark 分区数量 = hdfs blocks ,指定为 coalesce(100),Spark会将输入数据划分为100个分区。


现在我运行12个执行器,每个执行器有5个内核,这意味着它运行60个任务。这是否意味着,每个任务都可以在单个分区上独立运行?

stack.imgur.com/D7pcV.pngalt =带有执行器的工作节点>



正如你传递的可能会被传递



- num-executors 12 :在应用程序中启动的执行程序的数量



- executor-cores 5 :每个执行程序的核心数量。 1 core = 1任务一次

所以分区的执行将会像这样

第1轮



12个分区将由 12个执行者处理,每个执行者有5个任务(线程)

第二轮



12个分区将由 12每个执行者有5个任务(线程)




$ b

轮次:9(已处理96个分区):



4个分区将由 4个执行者分别由5个任务(线程)处理。




注:
通常,执行者可以快速完成指派的工作(各种参数,如数据区域,网络I / O,CPU等)。因此,它会通过等待配置的调度时间来选择要处理的下一个分区。


I'm bringing up a very silly question about Spark as I want to clear my confusion. I'm very new in Spark and still trying to understand how it works internally.

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the task will work on one single partition independently?

Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions
Round: 2 8 executors each with 5 cores => 40 tasks

process the rest of the 40 partitions and 4 executors never place a job for the 2nd time

Or all tasks from the same executor will work on the same partition?

Round: 1: 12 executors => process 12 partitions
Round: 2: 12 executors => process 12 partitions
Round: 3: 12 executors => process 12 partitions
....
....
....
Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions

解决方案

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

In spark by default number of partitions = hdfs blocks, as coalesce(100) is specified, Spark will divide input data into 100 partitions.

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently?

As you passed might be passed

--num-executors 12 : Number of executors to launch in an application.

--executor-cores 5 : Number of cores per executor. 1 core = 1 task at a time

So the execution of partitions will go like this

Round 1

12 partitions will be processed by 12 executors with 5 tasks(threads) each.

Round 2

12 partitions will be processed by 12 executors with 5 tasks(threads) each.
.
.
.

Round: 9 (96 partitions already processed):

4 partitions will be processed by 4 executors with 5 tasks(threads) each.

NOTE: Usually, Some executors may complete assigned work quickly(various parameters like data locality, Network I/O, CPU, etc). So, it will pick next partition to process by waiting configured amount of scheduling time.

这篇关于Spark与执行程序和内核的数量相关联的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆