如何使用RDD API在分区内进行排序(并避免在分区间进行排序)? [英] How to sort within partitions (and avoid sort across the partitions) using RDD API?

查看:270
本文介绍了如何使用RDD API在分区内进行排序(并避免在分区间进行排序)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是Hadoop MapReduce shuffle的默认行为,是对分区内而不是跨分区内的shuffle键进行排序(这是使键跨分区进行排序的总顺序)

It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)

我会问如何使用Spark RDD(在分区内排序,而不是跨分区排序)实现相同的事情

I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)

  1. RDD的sortByKey方法正在进行总体排序
  2. RDD的repartitionAndSortWithinPartitions在分区中进行排序,但不在跨分区中进行排序,但不幸的是,它增加了一个额外的步骤来进行分区.
  1. RDD's sortByKey method is doing total ordering
  2. RDD's repartitionAndSortWithinPartitions is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.

是否有直接的方法可以对分区进行排序,但不能对跨分区进行排序?

Is there a direct way to sort within partition but not cross partitions?

推荐答案

您可以使用DatasetsortWithinPartitions方法:

import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+

通常,洗牌是对分区进行排序的重要因素,因为它可以重用洗牌结构来进行排序,而无需立即将所有数据加载到内存中.

In general shuffle is an important factor in sorting partitions because it reuse shuffle structures to sort without loading all data into memory at once.

这篇关于如何使用RDD API在分区内进行排序(并避免在分区间进行排序)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆