为什么在重新分区 Spark 数据帧时会得到这么多空分区? [英] Why do I get so many empty partitions when repartionning a Spark Dataframe?

查看：30 发布时间：2021/11/14 22:20:25 apache-spark pyspark apache-spark-sql partitioning

本文介绍了为什么在重新分区 Spark 数据帧时会得到这么多空分区?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在 3 列上对数据框df1"进行分区.该数据框对于这 3 列恰好有 990 个独特的组合:

I want to partition a dataframe "df1" on 3 columns. This dataframe has exactly 990 unique combinaisons for those 3 columns:

In [17]: df1.createOrReplaceTempView("df1_view")

In [18]: spark.sql("select count(*) from (select distinct(col1,col2,col3) from df1_view) as t").show()
+--------+                                                                      
|count(1)|
+--------+
|     990|
+--------+

为了优化这个数据帧的处理，我想对 df1 进行分区以获得 990 个分区，每个关键可能性一个:

In order to optimize the processing of this dataframe, I want to partition df1 in order to get 990 partitions, one for each key possibility:

In [19]: df1.rdd.getNumPartitions()
Out[19]: 24

In [20]: df2 = df1.repartition(990, "col1", "col2", "col3")

In [21]: df2.rdd.getNumPartitions()
Out[21]: 990

我写了一个简单的方法来计算每个分区中的行数:

I wrote a simple way to count rows in each partition:

In [22]: def f(iterator):
    ...:     a = 0
    ...:     for partition in iterator:
    ...:         a = a + 1
    ...:     print(a)
    ...: 

In [23]: df2.foreachPartition(f)

我注意到实际上我得到的是 628 个带有一个或多个键值的分区，以及 362 个空分区.

And I notice that what I get in fact is 628 partitions with one or more key values, and 362 empty partitions.

我认为 spark 会以均匀的方式重新分区(1 个键值 = 1 个分区)，但这似乎不是这样，而且我觉得这种重新分区会增加数据倾斜，即使它应该是相反的...

I assumed spark would repartition in an even way (1 key value = 1 partition) but that does not seem like it, and I feel like this repartitionning is adding data skew even though it should be the other way around...

Spark 在列上对数据帧进行分区的算法是什么?有没有办法实现我认为可能的目标?

What's the algorithm Spark uses to partition a dataframe on columns ? Is there a way to achieve what I thought was possible ?

我在 Cloudera 上使用 Spark 2.2.0.

I'm using Spark 2.2.0 on Cloudera.

为什么在重新分区 Spark 数据帧时会得到这么多空分区? [英] Why do I get so many empty partitions when repartionning a Spark Dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么在重新分区 Spark 数据帧时会得到这么多空分区? [英] Why do I get so many empty partitions when repartionning a Spark Dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭