根据工作人员、核心和数据帧大小确定 Spark 分区的最佳数量 [英] Determining optimal number of Spark partitions based on workers, cores and DataFrame size

查看：21 发布时间：2021/11/14 21:20:06 apache-spark spark-dataframe distributed-computing partitioning bigdata

本文介绍了根据工作人员、核心和数据帧大小确定 Spark 分区的最佳数量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 Spark-land 中有几个相似但又不同的概念，围绕着如何将工作分派到不同节点并同时执行.具体来说，有:

Spark Driver 节点 (sparkDriverCount)
一个 Spark 集群可用的工作节点数量 (numWorkerNodes)
Spark 执行器的数量(numExecutors)
所有worker/executors同时操作的DataFrame (dataFrame)
dataFrame (numDFRows) 中的行数
dataFrame (numPartitions) 上的分区数
最后，每个工作节点上可用的 CPU 内核数 (numCpuCoresPerWorker)

我相信所有 Spark 集群都有唯一一个 Spark Driver，然后有 0+ 个工作节点.如果我错了，请首先纠正我！假设我对此或多或少是正确的，让我们在这里锁定一些变量.假设我们有一个包含 1 个驱动程序和 4 个工作节点的 Spark 集群，每个工作节点上有 4 个 CPU 内核(因此总共有 16 个 CPU 内核).所以这里的给定"是:

sparkDriverCount = 1numWorkerNodes = 4numCpuCores = numWorkerNodes * numCpuCoresPerWorker = 4 * 4 = 16

鉴于设置，我想知道如何确定一些事情.具体:

numWorkerNodes 和 numExecutors 之间是什么关系?是否有一些已知/普遍接受的工作人员与执行人员的比例?有没有办法确定给定 numWorkerNodes(或任何其他输入)的 numExecutors?
numDFRows 与 numPartitions 是否存在已知的/普遍接受的/最佳比率?如何根据 dataFrame 的大小计算最佳"分区数?
我从其他工程师那里听说，一般的经验法则"是:numPartitions = numWorkerNodes * numCpuCoresPerWorker，这是否属实?换句话说，它规定每个 CPU 内核应该有 1 个分区.

解决方案

Yes, a spark 应用程序具有一个且唯一的驱动程序.

<块引用>

numWorkerNodes 和 numExecutors 之间是什么关系?

一个worker可以承载多个executor，你可以把它想象成worker是你集群的机器/节点，executor是一个在那个worker上运行的进程(在核心中执行).

所以`numWorkerNodes <= numExecutors'.

<块引用>

他们有口粮吗?

就我个人而言，我曾在一个假的集群中工作，我的笔记本电脑是驱动程序，同一台笔记本电脑中的虚拟机是工人，在超过 10k 个节点的工业集群中，我没有不需要关心这个，因为看起来 spark 负责.

我只是使用:

--num-executors 64

当我启动/提交我的脚本和 spark 我猜想知道需要召集多少工人(当然，还要考虑其他参数和机器的性质).

因此，就我个人而言，我不知道任何这样的比率.

<小时><块引用>

numDFRows 与 numPartitions 是否存在已知的/普遍接受的/最佳比率?

我不知道有一个，但根据经验，您可以依赖 #executors 与 #executor.cores 的乘积，然后将其乘以 3 或 4.当然，这是一个启发式.在 pyspark 中，它看起来像这样:

sc = SparkContext(appName = "smeeb-App")total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores'))数据集 = sc.textFile(input_path, total_cores * 3)

<块引用>

如何根据 DataFrame 的大小计算最佳"分区数?

这是一个很好的问题.当然这很难回答，这取决于你的数据、集群等，但正如所讨论的这里和我自己.

分区太少，您将拥有大量数据，尤其是在处理bigdata，从而使您的应用程序面临内存压力.

分区太多，您的hdfs 承受很大压力，因为必须从 hdfs 随着分区数量的增加而显着增加(因为它维护临时文件等).^*

因此，您也需要为分区数量找到一个最佳位置，这是微调您的应用程序的一部分.:)

<块引用>

'经验法则'是:numPartitions = numWorkerNodes * numCpuCoresPerWorker，是真的吗?

啊，在看到这个之前我正在写上面的启发式.所以这已经得到了回答，但要考虑到 worker 和 executor 的区别.

<小时>

^* 我今天刚刚失败了:通过 Python 使用 Spark 准备我的大数据，当使用太多分区导致活动任务在 Spark UI 中是一个负数.

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is:

The Spark Driver node (sparkDriverCount)
The number of worker nodes available to a Spark cluster (numWorkerNodes)
The number of Spark executors (numExecutors)
The DataFrame being operated on by all workers/executors, concurrently (dataFrame)
The number of rows in the dataFrame (numDFRows)
The number of partitions on the dataFrame (numPartitions)
And finally, the number of CPU cores available on each worker nodes (numCpuCoresPerWorker)

I believe that all Spark clusters have one-and-only-one Spark Driver, and then 0+ worker nodes. If I'm wrong about that, please begin by correcting me! Assuming I'm more or less correct about that, let's lock in a few variables here. Let's say we have a Spark cluster with 1 Driver and 4 Worker nodes, and each Worker Node has 4 CPU cores on it (so a total of 16 CPU cores). So the "given" here is:

sparkDriverCount = 1
numWorkerNodes = 4
numCpuCores = numWorkerNodes * numCpuCoresPerWorker = 4 * 4 = 16

Given that as the setup, I'm wondering how to determine a few things. Specifically:

What is the relationship between numWorkerNodes and numExecutors? Is there some known/generally-accepted ratio of workers to executors? Is there a way to determine numExecutors given numWorkerNodes (or any other inputs)?
Is there a known/generally-accepted/optimal ratio of numDFRows to numPartitions? How does one calculate the 'optimal' number of partitions based on the size of the dataFrame?
I've heard from other engineers that a general 'rule of thumb' is: numPartitions = numWorkerNodes * numCpuCoresPerWorker, any truth to that? In other words, it prescribes that one should have 1 partition per CPU core.

解决方案

Yes, a spark application has one and only Driver.

What is the relationship between numWorkerNodes and numExecutors?

A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.

So `numWorkerNodes <= numExecutors'.

Is there any ration for them?

Personally, having worked in a fake cluster, where my laptop was the Driver and a virtual machine in the very same laptop was the worker, and in an industrial cluster of >10k nodes, I didn't need to care about that, since it seems that spark takes care of that.

I just use:

--num-executors 64

when I launch/submit my script and spark knows, I guess, how many workers it needs to summon (of course, by taking into account other parameters as well, and the nature of the machines).

Thus, personally, I don't know any such ratio.

Is there a known/generally-accepted/optimal ratio of numDFRows to numPartitions?

I am not aware of one, but as a rule of thumb you could rely on the product of #executors by #executor.cores, and then multiply that by 3 or 4. Of course this is a heuristic. In pyspark it would look like this:

sc = SparkContext(appName = "smeeb-App")
total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores'))
dataset = sc.textFile(input_path, total_cores * 3)

How does one calculate the 'optimal' number of partitions based on the size of the DataFrame?

That's a great question. Of course its hard to answer and it depends on your data, cluster, etc., but as discussed here with myself.

Too few partitions and you will have enormous chunks of data, especially when you are dealing with bigdata, thus putting your application in memory stress.

Too many partitions and you will have your hdfs taking much pressure, since all the metadata that has to be generated from the hdfs increases significantly as the number of partitions increase (since it maintains temp files, etc.). ^*

So what you want is too find a sweet spot for the number of partitions, which is one of the parts of fine tuning your application. :)

'rule of thumb' is: numPartitions = numWorkerNodes * numCpuCoresPerWorker, is it true?

Ah, I was writing the heuristic above before seeing this. So this is already answered, but take into account the difference of a worker and an executor.

^* I just failed for this today: Prepare my bigdata with Spark via Python, when using too many partitions caused Active tasks is a negative number in Spark UI.

这篇关于根据工作人员、核心和数据帧大小确定 Spark 分区的最佳数量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据工作人员、核心和数据帧大小确定 Spark 分区的最佳数量 [英] Determining optimal number of Spark partitions based on workers, cores and DataFrame size

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据工作人员、核心和数据帧大小确定 Spark 分区的最佳数量 [英] Determining optimal number of Spark partitions based on workers, cores and DataFrame size

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭