Spark与执行程序和内核的数量相关联 [英] Spark coalesce relationship with number of executors and cores
问题描述
我正在提出一个关于Spark的非常愚蠢的问题,因为我想澄清我的困惑。如果我有一个我想要的输入文件列表(假定为1000),我想在Spark中非常新,并且仍然试图了解它在内部是如何工作的。
处理或写入某处,我想使用coalesce将我的分区数量减少到100个。
现在我运行12个执行程序,每个执行程序5个内核,即60任务运行时。这是否意味着,每个任务都可以在单个分区上独立运行?
$ b
Round:1 12个执行器,每个执行器有5个核心=> 60任务进程60
分区
Round :2个8个执行器,每个执行器有5个核心=> 40个任务 b
处理剩下的40个分区,4个执行器第二次不放置
作业
或者来自同一个执行程序的所有任务都可以在同一个分区上运行?
回合:1:12执行者=>处理12个分区
回合:2:12
执行者=>处理12个分区
回合: 3:12执行者=>
处理12个分区
....
....
....
轮次:9(已处理96个分区
):4个执行程序=>处理剩余的4个
个分区
说,如果我有一个输入文件的列表(假设为1000),我想要处理或写入某处并且我想
默认情况下为spark 分区数量$ c使用coalesce将我的分区数量减少到100。 $ c> =
hdfs blocks ,指定为
coalesce(100)
,Spark会将输入数据划分为100个分区。
现在我运行12个执行器,每个执行器有5个内核,这意味着它运行60个任务。这是否意味着,每个任务都可以在单个分区上独立运行?
stack.imgur.com/D7pcV.pngalt =带有执行器的工作节点>
正如你传递的可能会被传递
- num-executors 12
:在应用程序中启动的执行程序的数量
- executor-cores 5
:每个执行程序的核心数量。 1 core = 1任务一次
所以分区的执行将会像这样
第1轮
12个分区将由 12个执行者处理,每个执行者有5个任务(线程) 。
第二轮
12个分区将由 12每个执行者有5个任务(线程)。
$ b轮次:9(已处理96个分区):
4个分区将由 4个执行者分别由5个任务(线程)处理。
注:
通常,执行者可以快速完成指派的工作(各种参数,如数据区域,网络I / O,CPU等)。因此,它会通过等待配置的调度时间来选择要处理的下一个分区。I'm bringing up a very silly question about Spark as I want to clear my confusion. I'm very new in Spark and still trying to understand how it works internally.
Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.
Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the task will work on one single partition independently?
Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions
Round: 2 8 executors each with 5 cores => 40 tasksprocess the rest of the 40 partitions and 4 executors never place a job for the 2nd time
Or all tasks from the same executor will work on the same partition?
Round: 1: 12 executors => process 12 partitions
Round: 2: 12 executors => process 12 partitions
Round: 3: 12 executors => process 12 partitions
....
....
....
Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions
解决方案Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.
In spark by default
number of partitions
=hdfs blocks
, ascoalesce(100)
is specified, Spark will divide input data into 100 partitions.Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently?
As you passed might be passed
--num-executors 12
: Number of executors to launch in an application.
--executor-cores 5
: Number of cores per executor. 1 core = 1 task at a timeSo the execution of partitions will go like this
Round 1
12 partitions will be processed by 12 executors with 5 tasks(threads) each.
Round 2
12 partitions will be processed by 12 executors with 5 tasks(threads) each.
.
.
.Round: 9 (96 partitions already processed):
4 partitions will be processed by 4 executors with 5 tasks(threads) each.
NOTE: Usually, Some executors may complete assigned work quickly(various parameters like data locality, Network I/O, CPU, etc). So, it will pick next partition to process by waiting configured amount of scheduling time.
这篇关于Spark与执行程序和内核的数量相关联的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!