排序后的数据帧分区数是多少? [英] Number of dataframe partitions after sorting?

查看:75
本文介绍了排序后的数据帧分区数是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用orderBy后,spark如何确定分区数?我一直认为结果数据帧具有spark.sql.shuffle.partitions,但这似乎并不正确:

How does spark determine the number of partitions after using an orderBy? I always thought that the resulting dataframe has spark.sql.shuffle.partitions, but this does not seem to be true :

val df = (1 to 10000).map(i => ("a",i)).toDF("n","i").repartition(10).cache

df.orderBy($"i").rdd.getNumPartitions // = 200 (=spark.sql.shuffle.partitions)
df.orderBy($"n").rdd.getNumPartitions // = 2 

在两种情况下,spark均执行+- Exchange rangepartitioning(i/n ASC NULLS FIRST, 200),那么第二种情况下的结果分区数如何为2?

In both cases, spark does +- Exchange rangepartitioning(i/n ASC NULLS FIRST, 200), so how can the resulting number of partitions in the second case be 2?

推荐答案

spark.sql.shuffle.partitions用作上限.分区的最终数量为1 <= partitions <= spark.sql.shuffle.partition.

spark.sql.shuffle.partitions is used as an upper bound. The final number of partitions is 1 <= partitions <= spark.sql.shuffle.partition.

如前所述,Spark中的排序通过RangePartitioner进行.它试图实现的是将数据集划分为大致相等范围的指定数量(spark.sql.shuffle.partition).

As you've mentioned, the sorting in Spark goes through RangePartitioner. What it tries to achieve is to partition your dataset into a specified number (spark.sql.shuffle.partition) of roughly equal ranges.

保证分区后相等的值将在同一分区中.值得检查

There's a guarantee that equal values will be in the same partition after the partitioning. It's worth checking RangePartitioning (not part of the public API) class documentation:

...

...

ordering中的表达式求值相同的所有行都将在同一分区中

All row where the expressions in ordering evaluate to the same values will be in the same partition

并且,如果不同的排序值的数量小于所需的分区数量,即可能的范围数量小于spark.sql.shuffle.partition,则最终将得到较少的分区数量.另外,这是来自 RangePartitioner Scaladoc:

And if the number of distinct ordering values is less than the desired number of partitions, i.e. the number of possible ranges is less than spark.sql.shuffle.partition, you'll end up with a smaller number of partitions. Also, here's a quote from RangePartitioner Scaladoc:

RangePartitioner创建的实际分区数可能 在以下情况下,该参数与partitions参数不同 采样记录的数量小于分区的值.

The actual number of partitions created by the RangePartitioner might not be the same as the partitions parameter, in the case where the number of sampled records is less than the value of partitions.

回到您的示例,n是一个常量("a"),无法分区.另一方面,i可以有10,000个可能的值,并被划分为200个(=spark.sql.shuffle.partition)范围或分区.

Going back to your example, n is a constant ("a") and could not be partitioned. On the other hand, i can have 10,000 possible values and is partitioned into 200 (=spark.sql.shuffle.partition) ranges or partitions.

请注意,这仅适用于DataFrame/Dataset API.使用RDD的sortByKey时,可以显式指定分区数,或者Spark将使用当前分区数.

Note that this is only true for DataFrame/Dataset API. When using RDD's sortByKey one can either specify the number of partitions explicitly or Spark will use the current number of partitions.

另请参阅:

这篇关于排序后的数据帧分区数是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆