如何使用RDD API在分区内进行排序(并避免在分区间进行排序)? [英] How to sort within partitions (and avoid sort across the partitions) using RDD API?
问题描述
这是Hadoop MapReduce shuffle的默认行为,是对分区内而不是跨分区内的shuffle键进行排序(这是使键跨分区进行排序的总顺序)
It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)
我会问如何使用Spark RDD(在分区内排序,而不是跨分区排序)实现相同的事情
I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)
- RDD的
sortByKey
方法正在进行总体排序 - RDD的
repartitionAndSortWithinPartitions
在分区中进行排序,但不在跨分区中进行排序,但不幸的是,它增加了一个额外的步骤来进行分区.
- RDD's
sortByKey
method is doing total ordering - RDD's
repartitionAndSortWithinPartitions
is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.
是否有直接的方法可以对分区进行排序,但不能对跨分区进行排序?
Is there a direct way to sort within partition but not cross partitions?
推荐答案
您可以使用Dataset
和sortWithinPartitions
方法:
import spark.implicits._
sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
.toDF("text")
.sortWithinPartitions($"text")
.show
+----+
|text|
+----+
| d|
| e|
| f|
| a|
| b|
| c|
+----+
通常,洗牌是对分区进行排序的重要因素,因为它可以重用洗牌结构来进行排序,而无需立即将所有数据加载到内存中.
In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.
这篇关于如何使用RDD API在分区内进行排序(并避免在分区间进行排序)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!