如何实现星火排序顺序? [英] How does Spark achieve sort order?

查看:176
本文介绍了如何实现星火排序顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个字符串列表。我过滤和放大器;对它们进行排序,并收集结果的驱动程序。然而,事情的分布,每个RDD有它自己独到的列表的一部分。那么,如何实现星火最终的排序顺序,是否合并的结果?

Assume I have a list of Strings. I filter & sort them, and collect the result to driver. However, things are distributed, and each RDD has it's own part of original list. So, how does Spark achieve the final sorted order, does it merge results?

推荐答案

在星火排序是一个多阶段过程,需要洗牌:

Sorting in Spark is a multiphase process which requires shuffling:


  1. 输入RDD进行采样,这个样本用于计算每个输出分区边界(其次样品收集

  2. 输入RDD使用分区 rangePartitioner 在第一步计算边界( partitionBy

  3. 从第二步每个分区是本地排序( mapPartitions

  1. input RDD is sampled and this sample is used to compute boundaries for each output partition (sample followed by collect)
  2. input RDD is partitioned using rangePartitioner with boundaries computed in the first step (partitionBy)
  3. each partition from the second step is sorted locally (mapPartitions)

当数据被收集到的所有剩下的就是遵循由分区定义的顺序。

When data is collected all what is left is to follow the order defined by the partitioner.

以上步骤都清楚地反映在调试字符串:

Above steps are clearly reflected in a debug string:

scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...

scala> rdd.sortBy(identity).toDebugString
res1: String = 
(6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions
 |  ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle
 +-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps
    |  ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize

这篇关于如何实现星火排序顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆