如何确定Apache Spark数据帧中的分区大小 [英] How to Determine The Partition Size in an Apache Spark Dataframe
问题描述
我一直在使用出色的答案回答此处发布在SE上的问题,以确定分区的数量以及分区在整个数据帧中的分布需要了解Dataframe Spark中的分区详细信息
I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark
有人可以帮我扩展答案以确定数据帧的分区大小吗?
Can someone help me expand on answers to determine the partition size of dataframe?
谢谢
推荐答案
调整分区大小是不可避免的,与调整分区数有关.在此范围内至少要考虑三个因素:
Tuning the partition size is inevitably, linked to tuning the number of partitions. There're at least 3 factors to consider in this scope:
好"字样高度的并行性很重要,因此您可能希望拥有大量的分区,从而导致较小的分区大小.
A "good" high level of parallelism is important, so you may want to have a big number of partitions, resulting in a small partition size.
但是,由于以下第三点-分配费用,该数字存在上限.不过,它仍然排在第一位,因此,如果您要犯一个错误,可以从高并行度的角度开始.
However, there is an upper bound of the number due to the following 3rd point - distribution overhead. Nevertheless, it's still ranked priority #1, so let's say if you have to make a mistake, start with the side of high level of parallelism.
通常,建议每个内核2到4个任务.
Generally, it's recommended 2 to 4 tasks per core.
- 火花文档:
通常,我们建议您群集中的每个CPU内核执行2-3个任务.
- 行动中的火花(作者Petar Zecevi´c)写的书(第74页):
- The book Spark in action (author Petar Zecevi´c) writes (page 74):
我们建议使用的分区数量是集群中核心数量的三到四倍