Spark如何将切片与任务/执行者/工作人员并行化? [英] How does Spark paralellize slices to tasks/executors/workers?

查看:139
本文介绍了Spark如何将切片与任务/执行者/工作人员并行化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个2节点的Spark集群,每个节点具有4个核心.

I have a 2-node Spark cluster with 4 cores per node.

        MASTER
(Worker-on-master)              (Worker-on-node1)

火花配置:

  • 从站:主节点,node1
  • SPARK_WORKER_INSTANCES = 1

我正在尝试了解Spark的paralellize行为. sparkPi示例包含以下代码:

I am trying to understand Spark's paralellize behaviour. The sparkPi example has this code:

val slices = 8  // my test value for slices
val n = 100000 * slices
val count = spark.parallelize(1 to n, slices).map { i =>
  val x = random * 2 - 1
  val y = random * 2 - 1
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)

根据文档:

Spark将为群集的每个切片运行一个任务.通常,群集中的每个CPU都需要2-4个切片.

Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster.

我将slice设置为8,这意味着工作集将在集群上的8个任务之间划分,每个工作节点依次获得4个任务(每个核心1:1)

I set slices to be 8 which means the working set will be divided among 8 tasks on the cluster, in turn each worker node gets 4 tasks (1:1 per core)

问题:

  1. 我在哪里可以看到任务级别的详细信息?在执行程序内部,我看不到任务分解,因此可以看到切片对UI的影响.

  1. Where can I see task level details? Inside executors I don't see task breakdown so I can see the effect of slices on the UI.

如何以编程方式找到上述map函数的工作集大小?我假设它是n/slices(高于100000)

How to programmatically find the working set size for the map function above? I assume it is n/slices (100000 above)

执行程序运行的多个任务是按顺序运行还是并行运行在多个线程中?

Are the multiple tasks run by an executor run sequentially or paralell in multiple threads?

每个CPU落后2-4个切片.

Reasoning behind 2-4 slices per CPU.

我认为理想情况下,我们应该调整SPARK_WORKER_INSTANCES使其对应于每个节点(在同构集群中)的核心数,以便每个核心都有自己的执行器和任务(1:1:1)

I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to number of cores in each node (in a homogeneous cluster) so that each core gets its own executor and task (1:1:1)

推荐答案

我将尽力回答您的问题:

I will try to answer your question as best I can:

1.-在哪里可以看到任务级别的详细信息?

提交作业时,Spark会在每个工作节点上(主节点除外)存储有关任务分解的信息.我相信这些数据存储在 spark 目录下的 work 文件夹中(我只用Spark for EC2进行过测试).

When submitting a job, Spark stores information about the task breakdown on each worker node, apart from the master. This data is stored, I believe (I have only tested with Spark for EC2), on the work folder under the spark directory.

2.-如何以编程方式找到地图功能的工作集尺寸?

尽管我不确定是否将大小存储在切片的内存中,但第一个答案中提到的日志提供了有关每个RDD分区包含的行数的信息.

Although I am not sure if it stores the size in memory of the slices, the logs mentioned on the first answer provide information about the amount of lines each RDD partition contains.

3.-执行程序运行的多个任务是在多个线程中顺序运行还是并行运行?

我相信节点内的不同任务是按顺序运行的.这显示在上面指示的日志中,该日志指示每个任务的开始和结束时间.

I believe diferent tasks inside a node run sequentially. This is shown on the logs indicated above, which indicate the start and end time of every task.

4.-每个CPU 2-4个切片后面的推理

某些节点比其他节点更快地完成任务.切片数多于可用内核数时,将以平衡的方式分配任务,从而避免了由于节点速度较慢而导致的处理时间过长.

Some nodes finish their tasks faster than others. Having more slices than available cores distributes the tasks in a balanced way avoiding long processing time due to slower nodes.

这篇关于Spark如何将切片与任务/执行者/工作人员并行化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆