如何使用RDD API在分区内进行排序(并避免在分区间进行排序)? [英] How to sort within partitions (and avoid sort across the partitions) using RDD API?

查看：270 发布时间：2020/9/4 3:15:14 apache-spark

本文介绍了如何使用RDD API在分区内进行排序(并避免在分区间进行排序)?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是Hadoop MapReduce shuffle的默认行为，是对分区内而不是跨分区内的shuffle键进行排序(这是使键跨分区进行排序的总顺序)

It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)

我会问如何使用Spark RDD(在分区内排序，而不是跨分区排序)实现相同的事情

I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)

RDD的sortByKey方法正在进行总体排序
RDD的repartitionAndSortWithinPartitions在分区中进行排序，但不在跨分区中进行排序，但不幸的是，它增加了一个额外的步骤来进行分区.

RDD's sortByKey method is doing total ordering
RDD's repartitionAndSortWithinPartitions is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.

是否有直接的方法可以对分区进行排序，但不能对跨分区进行排序?

Is there a direct way to sort within partition but not cross partitions?

推荐答案

您可以使用Dataset和sortWithinPartitions方法:

import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+

通常，洗牌是对分区进行排序的重要因素，因为它可以重用洗牌结构来进行排序，而无需立即将所有数据加载到内存中.

In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.

这篇关于如何使用RDD API在分区内进行排序(并避免在分区间进行排序)?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用RDD API在分区内进行排序(并避免在分区间进行排序)? [英] How to sort within partitions (and avoid sort across the partitions) using RDD API?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用RDD API在分区内进行排序(并避免在分区间进行排序)? [英] How to sort within partitions (and avoid sort across the partitions) using RDD API?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭