排序后的数据帧分区数? [英] Number of dataframe partitions after sorting?

查看:17
本文介绍了排序后的数据帧分区数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

spark 如何在使用 orderBy 后确定分区数?我一直认为生成的数据帧有 spark.sql.shuffle.partitions,但这似乎不是真的:

How does spark determine the number of partitions after using an orderBy? I always thought that the resulting dataframe has spark.sql.shuffle.partitions, but this does not seem to be true :

val df = (1 to 10000).map(i => ("a",i)).toDF("n","i").repartition(10).cache

df.orderBy($"i").rdd.getNumPartitions // = 200 (=spark.sql.shuffle.partitions)
df.orderBy($"n").rdd.getNumPartitions // = 2 

在这两种情况下,spark 都+- Exchange rangepartitioning(i/n ASC NULLS FIRST, 200),那么第二种情况下的分区数怎么会是2呢?

In both cases, spark does +- Exchange rangepartitioning(i/n ASC NULLS FIRST, 200), so how can the resulting number of partitions in the second case be 2?

推荐答案

spark.sql.shuffle.partitions 用作上限.最终分区数为 1 <= partitions <= spark.sql.shuffle.partition.

spark.sql.shuffle.partitions is used as an upper bound. The final number of partitions is 1 <= partitions <= spark.sql.shuffle.partition.

正如您所提到的,Spark 中的排序通过 RangePartitioner.它试图实现的是将您的数据集划分为大致相等范围的指定数量 (spark.sql.shuffle.partition).

As you've mentioned, the sorting in Spark goes through RangePartitioner. What it tries to achieve is to partition your dataset into a specified number (spark.sql.shuffle.partition) of roughly equal ranges.

保证分区后相同的值将位于同一分区中.值得检查 RangePartitioning(不是公共 API 的一部分)类文档:

There's a guarantee that equal values will be in the same partition after the partitioning. It's worth checking RangePartitioning (not part of the public API) class documentation:

...

ordering 中的表达式计算为相同值的所有行将在同一分区中

All row where the expressions in ordering evaluate to the same values will be in the same partition

如果不同排序值的数量小于所需的分区数量,即可能范围的数量小于 spark.sql.shuffle.partition,您将得到更少的分区.另外,这里引用了 <代码>RangePartitioner Scaladoc:

And if the number of distinct ordering values is less than the desired number of partitions, i.e. the number of possible ranges is less than spark.sql.shuffle.partition, you'll end up with a smaller number of partitions. Also, here's a quote from RangePartitioner Scaladoc:

RangePartitioner 创建的实际分区数可能与 partitions 参数不同,在这种情况下采样记录数小于分区值.

The actual number of partitions created by the RangePartitioner might not be the same as the partitions parameter, in the case where the number of sampled records is less than the value of partitions.

回到您的示例,n 是一个常量 ("a") 并且无法分区.另一方面,i 可以有 10,000 个可能的值,并被划分为 200 个(=spark.sql.shuffle.partition)范围或分区.

Going back to your example, n is a constant ("a") and could not be partitioned. On the other hand, i can have 10,000 possible values and is partitioned into 200 (=spark.sql.shuffle.partition) ranges or partitions.

请注意,这仅适用于 DataFrame/Dataset API.使用 RDD 的 sortByKey 时,可以明确指定分区数,否则 Spark 将使用当前分区数.

Note that this is only true for DataFrame/Dataset API. When using RDD's sortByKey one can either specify the number of partitions explicitly or Spark will use the current number of partitions.

另见:

这篇关于排序后的数据帧分区数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆