如何排序在斯卡拉星火的RDD? [英] How to sort an RDD in Scala Spark?
问题描述
阅读星火方法sortByKey:
Reading Spark method sortByKey :
sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
是否有可能返回的结果只是N的金额。因此,而不是返回所有的结果,只返回前10名,我可以排序的集合转换为数组,并使用取
方法,但因为这是一个O(N)操作有没有更有效的方法?
Is it possible to return just "N" amount of results. So instead of returning all results, just return the top 10. I could convert the sorted collection to an Array and use take
method but since this is an O(N) operation is there a more efficient method ?
推荐答案
最有可能你已经细读源$ C $ C:
Most likely you have already perused the source code:
class OrderedRDDFunctions {
// <snip>
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P] = {
val part = new RangePartitioner(numPartitions, self, ascending)
val shuffled = new ShuffledRDD[K, V, P](self, part)
shuffled.mapPartitions(iter => {
val buf = iter.toArray
if (ascending) {
buf.sortWith((x, y) => x._1 < y._1).iterator
} else {
buf.sortWith((x, y) => x._1 > y._1).iterator
}
}, preservesPartitioning = true)
}
而且,正如你说的,在全部数据必须经过洗牌阶段 - 作为片断看到
And, as you say, the entire data must go through the shuffle stage - as seen in the snippet.
不过,你对随后调用取(K)关注可能不那么准确。该操作通过所有N个项目不循环:
However, your concern about subsequently invoking take(K) may not be so accurate. This operation does NOT cycle through all N items:
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
* results from that partition to estimate the number of additional partitions needed to satisfy
* the limit.
*/
def take(num: Int): Array[T] = {
那么,这似乎:
O(myRdd.take(K))&LT;&LT; O(myRdd.sortByKey())〜= O(myRdd.sortByKey.take(K))
(至少对于小K)下;&下; O(myRdd.sortByKey()。收集()
O(myRdd.take(K)) << O(myRdd.sortByKey()) ~= O(myRdd.sortByKey.take(k)) (at least for small K) << O(myRdd.sortByKey().collect()
这篇关于如何排序在斯卡拉星火的RDD?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!