如何确定Apache Spark数据帧中的分区大小 [英] How to Determine The Partition Size in an Apache Spark Dataframe

查看:64
本文介绍了如何确定Apache Spark数据帧中的分区大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用出色的答案回答此处发布在SE上的问题,以确定分区的数量以及分区在整个数据帧中的分布需要了解Dataframe Spark中的分区详细信息

I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark

有人可以帮我扩展答案以确定数据帧的分区大小吗?

Can someone help me expand on answers to determine the partition size of dataframe?

谢谢

推荐答案

调整分区大小是不可避免的,与调整分区数有关.在此范围内至少要考虑三个因素:

Tuning the partition size is inevitably, linked to tuning the number of partitions. There're at least 3 factors to consider in this scope:

好"字样高度的并行性很重要,因此您可能希望拥有大量的分区,从而导致较小的分区大小.

A "good" high level of parallelism is important, so you may want to have a big number of partitions, resulting in a small partition size.

但是,由于以下第三点-分配费用,该数字存在上限.不过,它仍然排在第一位,因此,如果您要犯一个错误,可以从高并行度的角度开始.

However, there is an upper bound of the number due to the following 3rd point - distribution overhead. Nevertheless, it's still ranked priority #1, so let's say if you have to make a mistake, start with the side of high level of parallelism.

通常,建议每个内核2到4个任务.

Generally, it's recommended 2 to 4 tasks per core.

通常,我们建议您群集中的每个CPU内核执行2-3个任务.

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆