如何排序在斯卡拉星火的RDD？ [英] How to sort an RDD in Scala Spark?

查看：132 发布时间：2016/5/22 15:14:06 scala apache-spark

本文介绍了如何排序在斯卡拉星火的RDD？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

阅读星火方法sortByKey：

Reading Spark method sortByKey :

sortByKey([ascending], [numTasks])   When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

是否有可能返回的结果只是N的金额。因此，而不是返回所有的结果，只返回前10名，我可以排序的集合转换为数组，并使用取方法，但因为这是一个O（N）操作有没有更有效的方法？

Is it possible to return just "N" amount of results. So instead of returning all results, just return the top 10. I could convert the sorted collection to an Array and use take method but since this is an O(N) operation is there a more efficient method ?

推荐答案

最有可能你已经细读源$ C $ C：

Most likely you have already perused the source code:

  class OrderedRDDFunctions {
   // <snip>
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P] = {
    val part = new RangePartitioner(numPartitions, self, ascending)
    val shuffled = new ShuffledRDD[K, V, P](self, part)
    shuffled.mapPartitions(iter => {
      val buf = iter.toArray
      if (ascending) {
        buf.sortWith((x, y) => x._1 < y._1).iterator
      } else {
        buf.sortWith((x, y) => x._1 > y._1).iterator
      }
    }, preservesPartitioning = true)
  }

而且，正如你说的，在全部数据必须经过洗牌阶段 - 作为片断看到

And, as you say, the entire data must go through the shuffle stage - as seen in the snippet.

不过，你对随后调用取（K）关注可能不那么准确。该操作通过所有N个项目不循环：

However, your concern about subsequently invoking take(K) may not be so accurate. This operation does NOT cycle through all N items:

  /**
   * Take the first num elements of the RDD. It works by first scanning one partition, and use the
   * results from that partition to estimate the number of additional partitions needed to satisfy
   * the limit.
   */
  def take(num: Int): Array[T] = {

那么，这似乎：

O（myRdd.take（K））＆LT;＆LT; O（myRdd.sortByKey（））〜= O（myRdd.sortByKey.take（K））
（至少对于小K）下;＆下; O（myRdd.sortByKey（）。收集（）

O(myRdd.take(K)) << O(myRdd.sortByKey()) ~= O(myRdd.sortByKey.take(k)) (at least for small K) << O(myRdd.sortByKey().collect()

这篇关于如何排序在斯卡拉星火的RDD？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何排序在斯卡拉星火的RDD？ [英] How to sort an RDD in Scala Spark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何排序在斯卡拉星火的RDD？ [英] How to sort an RDD in Scala Spark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭