如何确定Apache Spark数据帧中的分区大小 [英] How to Determine The Partition Size in an Apache Spark Dataframe

查看：64 发布时间：2021/4/8 20:05:56 apache-spark pyspark databricks

本文介绍了如何确定Apache Spark数据帧中的分区大小的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在使用出色的答案回答此处发布在SE上的问题，以确定分区的数量以及分区在整个数据帧中的分布需要了解Dataframe Spark中的分区详细信息

I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark

有人可以帮我扩展答案以确定数据帧的分区大小吗?

Can someone help me expand on answers to determine the partition size of dataframe?

谢谢

内存配件

如果分区的大小非常大(例如，大于1 GB)，则可能会出现诸如垃圾回收，内存不足错误等问题，尤其是当按照

有时候，您会收到OutOfMemoryError的错误，不是因为RDD不能容纳在内存中，而是因为您其中一项任务(例如groupByKey中的reduce任务之一)的工作量太大.Spark的随机播放操作(sortByKey，groupByKey，reduceByKey，join等)会在每个任务中建立一个哈希表来执行分组，这通常会很大...

因此，这里出现了分区数量众多(或分区大小较小)的另一个优点.

Hence here comes another pros of big number of partitions (or, small partition size).

分布式计算会带来开销，因此您也无法发挥任何作用.如果每个任务的执行时间少于100毫秒，则由于以下原因，应用程序可能会产生可观的开销:

Distributed computing comes with overhead, so you can't go to an extreme either. If each task takes less than 100ms to execute, the application might have remarkable overhead due to:

数据获取，磁盘搜索
数据移动，任务分配
任务状态跟踪

，在这种情况下，您可以降低并行度并稍微增加分区大小.

, in which case you may lower the level of parallelism and increase partition size a bit.

外卖

根据经验，人们通常会尝试每个分区使用100-1000MB，那么为什么不从此开始呢?请记住，这个数字可能需要随时间重新调整.

Empirically, people usually try with 100-1000MB per partition, so why not start with that? And remember that the number may need to re-tuned along the time..

这篇关于如何确定Apache Spark数据帧中的分区大小的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何确定Apache Spark数据帧中的分区大小 [英] How to Determine The Partition Size in an Apache Spark Dataframe

问题描述

推荐答案

内存配件

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何确定Apache Spark数据帧中的分区大小 [英] How to Determine The Partition Size in an Apache Spark Dataframe

问题描述

推荐答案

内存配件

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭