根据工作线程,内核和DataFrame大小确定最佳的Spark分区数量 [英] Determining optimal number of Spark partitions based on workers, cores and DataFrame size

查看:367
本文介绍了根据工作线程,内核和DataFrame大小确定最佳的Spark分区数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark-land中有几个类似但又不同的概念,围绕着如何将工作移植到不同的节点并同时执行.具体来说,有:

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is:

  • Spark Driver节点(sparkDriverCount)
  • Spark群集(numWorkerNodes)可用的工作节点数
  • Spark执行程序(numExecutors)的数量
  • 该DataFrame由所有工作人员/执行者同时操作(dataFrame)
  • dataFrame(numDFRows)
  • 中的行数
  • dataFrame(numPartitions)
  • 上的分区数
  • 最后,每个工作节点(numCpuCoresPerWorker)上可用的CPU内核数
  • The Spark Driver node (sparkDriverCount)
  • The number of worker nodes available to a Spark cluster (numWorkerNodes)
  • The number of Spark executors (numExecutors)
  • The DataFrame being operated on by all workers/executors, concurrently (dataFrame)
  • The number of rows in the dataFrame (numDFRows)
  • The number of partitions on the dataFrame (numPartitions)
  • And finally, the number of CPU cores available on each worker nodes (numCpuCoresPerWorker)

相信所有Spark群集都具有一个且唯一的 Spark驱动程序,然后有0个以上的工作节点.如果我错了,请先纠正我!假设我对此大致上是正确的,我们在这里锁定几个变量.假设我们有一个Spark集群,其中包含1个驱动程序和4个Worker节点,并且每个Worker Node都具有4个CPU内核(因此总共有16个CPU内核).所以这里的给定"是:

I believe that all Spark clusters have one-and-only-one Spark Driver, and then 0+ worker nodes. If I'm wrong about that, please begin by correcting me! Assuming I'm more or less correct about that, let's lock in a few variables here. Let's say we have a Spark cluster with 1 Driver and 4 Worker nodes, and each Worker Node has 4 CPU cores on it (so a total of 16 CPU cores). So the "given" here is:

sparkDriverCount = 1
numWorkerNodes = 4
numCpuCores = numWorkerNodes * numCpuCoresPerWorker = 4 * 4 = 16

考虑到作为设置,我想知道如何确定一些事情.具体来说:

Given that as the setup, I'm wondering how to determine a few things. Specifically:

  • numWorkerNodesnumExecutors之间是什么关系?工人与遗嘱执行人之间是否有已知/普遍接受的比例?有没有一种方法可以根据给定的numWorkerNodes(或任何其他输入)来确定numExecutors?
  • numDFRowsnumPartitions的已知/普遍接受/最佳比率是否存在?如何根据dataFrame的大小来计算最佳"分区数?
  • 我从其他工程师那里听说,一般的经验法则"是:numPartitions = numWorkerNodes * numCpuCoresPerWorker,这有什么道理吗?换句话说,它规定每个CPU内核应有1个分区.
  • What is the relationship between numWorkerNodes and numExecutors? Is there some known/generally-accepted ratio of workers to executors? Is there a way to determine numExecutors given numWorkerNodes (or any other inputs)?
  • Is there a known/generally-accepted/optimal ratio of numDFRows to numPartitions? How does one calculate the 'optimal' number of partitions based on the size of the dataFrame?
  • I've heard from other engineers that a general 'rule of thumb' is: numPartitions = numWorkerNodes * numCpuCoresPerWorker, any truth to that? In other words, it prescribes that one should have 1 partition per CPU core.

推荐答案

是的,应用程序具有

numWorkerNodesnumExecutors之间是什么关系?

What is the relationship between numWorkerNodes and numExecutors?

一个工作程序可以托管多个执行程序,您可以将其视为集群的机器/节点,而该执行程序则是在该工作程序上运行的进程(在内核中执行).

A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.

所以`numWorkerNodes< = numExecutors'.

So `numWorkerNodes <= numExecutors'.

他们有口粮吗?

Is there any ration for them?

我个人是在一个假的集群中工作的,我的笔记本电脑是驱动程序,而同一台笔记本电脑中的虚拟机是工人,而在一个> 10k节点的工业集群中,我没有不必担心,因为似乎照顾好了.

Personally, having worked in a fake cluster, where my laptop was the Driver and a virtual machine in the very same laptop was the worker, and in an industrial cluster of >10k nodes, I didn't need to care about that, since it seems that spark takes care of that.

我只是使用:

--num-executors 64

当我启动/提交脚本并我猜想知道需要召唤多少个工人(当然,还要考虑其他参数以及机器的性质).

when I launch/submit my script and spark knows, I guess, how many workers it needs to summon (of course, by taking into account other parameters as well, and the nature of the machines).

因此,就我个人而言,我不知道任何这样的比率.

Thus, personally, I don't know any such ratio.

numDFRowsnumPartitions的已知/普遍接受/最佳比率是否存在?

Is there a known/generally-accepted/optimal ratio of numDFRows to numPartitions?

我不知道一个,但是根据经验,您可以依靠#executor.cores乘以#executors的乘积,然后将其乘以3或4.当然,这是启发式.在的问题中,它看起来像这样:

I am not aware of one, but as a rule of thumb you could rely on the product of #executors by #executor.cores, and then multiply that by 3 or 4. Of course this is a heuristic. In pyspark it would look like this:

sc = SparkContext(appName = "smeeb-App")
total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores'))
dataset = sc.textFile(input_path, total_cores * 3)

如何根据DataFrame的大小来计算最佳"分区数?

How does one calculate the 'optimal' number of partitions based on the size of the DataFrame?

这是一个很好的问题.当然,它很难回答,并且取决于您的数据,群集等,但是正如所讨论的

That's a great question. Of course its hard to answer and it depends on your data, cluster, etc., but as discussed here with myself.

分区太少,您将拥有大量数据,尤其是在处理,从而使您的应用程序处于内存紧张状态.

Too few partitions and you will have enormous chunks of data, especially when you are dealing with bigdata, thus putting your application in memory stress.

分区过多,您将拥有承受了很大的压力,因为所有必须从随着分区数量的增加而显着增加(因为它维护临时文件等). *

Too many partitions and you will have your hdfs taking much pressure, since all the metadata that has to be generated from the hdfs increases significantly as the number of partitions increase (since it maintains temp files, etc.). *

因此,您还想要找到一个最佳位置作为分区数量,这是微调应用程序的一部分. :)

So what you want is too find a sweet spot for the number of partitions, which is one of the parts of fine tuning your application. :)

经验法则"是:numPartitions = numWorkerNodes * numCpuCoresPerWorker,是真的吗?

啊,我在看到上述内容之前就在写上面的启发式方法.因此,已经回答了这一问题,但要考虑到 worker executor 的区别.

Ah, I was writing the heuristic above before seeing this. So this is already answered, but take into account the difference of a worker and an executor.

* 我今天为此失败:当使用太多分区导致 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆