Spark与执行程序和内核的数量相关联 [英] Spark coalesce relationship with number of executors and cores

查看：118 发布时间：2018/5/31 19:24:39 hadoop apache-spark yarn

本文介绍了Spark与执行程序和内核的数量相关联的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在提出一个关于Spark的非常愚蠢的问题，因为我想澄清我的困惑。如果我有一个我想要的输入文件列表（假定为1000），我想在Spark中非常新，并且仍然试图了解它在内部是如何工作的。

处理或写入某处，我想使用coalesce将我的分区数量减少到100个。

现在我运行12个执行程序，每个执行程序5个内核，即60任务运行时。这是否意味着，每个任务都可以在单个分区上独立运行？
$ b

Round：1 12个执行器，每个执行器有5个核心=> 60任务进程60
分区

Round ：2个8个执行器，每个执行器有5个核心=> 40个任务 b

处理剩下的40个分区，4个执行器第二次不放置
作业

或者来自同一个执行程序的所有任务都可以在同一个分区上运行？

回合：1：12执行者=>处理12个分区

回合：2：12
执行者=>处理12个分区

回合： 3：12执行者=>
处理12个分区

....

....

....

轮次：9（已处理96个分区
）：4个执行程序=>处理剩余的4个
个分区

解决方案

说，如果我有一个输入文件的列表（假设为1000），我想要处理或写入某处并且我想

默认情况下为spark 分区数量 = hdfs blocks ，指定为 coalesce（100），Spark会将输入数据划分为100个分区。
现在我运行12个执行器，每个执行器有5个内核，这意味着它运行60个任务。这是否意味着，每个任务都可以在单个分区上独立运行？ stack.imgur.com/D7pcV.pngalt =带有执行器的工作节点> 正如你传递的可能会被传递 - num-executors 12 ：在应用程序中启动的执行程序的数量- executor-cores 5 ：每个执行程序的核心数量。 1 core = 1任务一次所以分区的执行将会像这样第1轮 12个分区将由 12个执行者处理，每个执行者有5个任务（线程）。第二轮 12个分区将由 12每个执行者有5个任务（线程）。 $ b 轮次：9（已处理96个分区）： 4个分区将由 4个执行者分别由5个任务（线程）处理。注：通常，执行者可以快速完成指派的工作（各种参数，如数据区域，网络I / O，CPU等）。因此，它会通过等待配置的调度时间来选择要处理的下一个分区。 I'm bringing up a very silly question about Spark as I want to clear my confusion. I'm very new in Spark and still trying to understand how it works internally. Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100. Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the task will work on one single partition independently? Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions Round: 2 8 executors each with 5 cores => 40 tasks process the rest of the 40 partitions and 4 executors never place a job for the 2nd time Or all tasks from the same executor will work on the same partition? Round: 1: 12 executors => process 12 partitions Round: 2: 12 executors => process 12 partitions Round: 3: 12 executors => process 12 partitions .... .... .... Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions 解决方案 Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100. In spark by default number of partitions = hdfs blocks, as coalesce(100) is specified, Spark will divide input data into 100 partitions. Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently? As you passed might be passed --num-executors 12 : Number of executors to launch in an application. --executor-cores 5 : Number of cores per executor. 1 core = 1 task at a time So the execution of partitions will go like this Round 1 12 partitions will be processed by 12 executors with 5 tasks(threads) each. Round 2 12 partitions will be processed by 12 executors with 5 tasks(threads) each. . . . Round: 9 (96 partitions already processed): 4 partitions will be processed by 4 executors with 5 tasks(threads) each. NOTE: Usually, Some executors may complete assigned work quickly(various parameters like data locality, Network I/O, CPU, etc). So, it will pick next partition to process by waiting configured amount of scheduling time. 这篇关于Spark与执行程序和内核的数量相关联的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Spark与执行程序和内核的数量相关联 [英] Spark coalesce relationship with number of executors and cores

问题描述

第1轮

第二轮

轮次：9（已处理96个分区）：

Round 1

Round 2

Round: 9 (96 partitions already processed):

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

Spark与执行程序和内核的数量相关联 [英] Spark coalesce relationship with number of executors and cores

问题描述

第1轮

第二轮

轮次：9（已处理96个分区）：

Round 1

Round 2

Round: 9 (96 partitions already processed):

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭