如何排序在斯卡拉星火的RDD? [英] How to sort an RDD in Scala Spark?

查看:132
本文介绍了如何排序在斯卡拉星火的RDD?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

阅读星火方法sortByKey:

Reading Spark method sortByKey :

sortByKey([ascending], [numTasks])   When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

是否有可能返回的结果只是N的金额。因此,而不是返回所有的结果,只返回前10名,我可以排序的集合转换为数组,并使用方法,但因为这是一个O(N)操作有没有更有效的方法?

Is it possible to return just "N" amount of results. So instead of returning all results, just return the top 10. I could convert the sorted collection to an Array and use take method but since this is an O(N) operation is there a more efficient method ?

推荐答案

最有可能你已经细读源$ C ​​$ C:

Most likely you have already perused the source code:

  class OrderedRDDFunctions {
   // <snip>
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P] = {
    val part = new RangePartitioner(numPartitions, self, ascending)
    val shuffled = new ShuffledRDD[K, V, P](self, part)
    shuffled.mapPartitions(iter => {
      val buf = iter.toArray
      if (ascending) {
        buf.sortWith((x, y) => x._1 < y._1).iterator
      } else {
        buf.sortWith((x, y) => x._1 > y._1).iterator
      }
    }, preservesPartitioning = true)
  }

而且,正如你说的,在全部数据必须经过洗牌阶段 - 作为片断看到

And, as you say, the entire data must go through the shuffle stage - as seen in the snippet.

不过,你对随后调用取(K)关注可能不那么准确。该操作通过所有N个项目不循环:

However, your concern about subsequently invoking take(K) may not be so accurate. This operation does NOT cycle through all N items:

  /**
   * Take the first num elements of the RDD. It works by first scanning one partition, and use the
   * results from that partition to estimate the number of additional partitions needed to satisfy
   * the limit.
   */
  def take(num: Int): Array[T] = {

那么,这似乎:

O(myRdd.take(K))&LT;&LT; O(myRdd.sortByKey())〜= O(myRdd.sortByKey.take(K))
  (至少对于小K)下;&下; O(myRdd.sortByKey()。收集()

O(myRdd.take(K)) << O(myRdd.sortByKey()) ~= O(myRdd.sortByKey.take(k)) (at least for small K) << O(myRdd.sortByKey().collect()

这篇关于如何排序在斯卡拉星火的RDD?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆