在Apache的星火多个字段排序 [英] Sorting by multiple fields in Apache Spark
问题描述
我在火花RDD。的RDD的每个元素是一个列表。
此外,所有的元素是类似的图案的列表,因此它有点像一个表。
我需要通过一些列排序的RDD,在一个特定的优先级顺序。
I have a RDD in spark. Each element of the RDD is a list. Moreover, all the elements are lists of similar pattern, so its kinda like a table. I need the RDD sorted by some columns, in a specific priority order.
我怎样才能做到这一点?
How can I achieve this?
PS:这是我试过
我试图领域具有最高优先级,然后按它来进行排序,然后每个结果由现场与第二最高优先级排序。我这样做递归,并加入了结果。
但是,使用RDD.groupBy这么多次讲得非常非常缓慢。
I tried to sort by the field with highest priority, then group by it, then sort each result by the field with second highest priority. I did this recursively, and joined the results. But, using RDD.groupBy so many times made it very very slow.
推荐答案
如果你想简单地按升序/降序排列有两件你需要做它的工作:
If you want to simply sort in ascending / descending order there are two pieces you need to make it work:
-
RDD。 rdd.sortBy
功能,由给定的keyfunc
的RDD排序(...) -
知识Python的
列表
和元组
的字典顺序进行比较:
RDD.rdd.sortBy
function which "sorts (...) RDD by the givenkeyfunc
"knowledge that Python
lists
andtuples
are compared lexicographically:
>>> (1, 2) < (3, 4)
True
>>> (5, 6) < (3, 4)
False
>>> ("foo", 1) < ("foo", 2, 5)
True
>>> ("bar", 1, 2) > ("bar", 1)
True
只要在类似结合这两种rdd.sortBy(波长X:(X [0],X [3]))
,你是好去。
Simply combine these two in something like rdd.sortBy(lambda x: (x[0], x[3]))
and you're good to go.
如果你需要混合排序(由一些值下降,被其他升序)非数值您可以嵌入在 keyfunc
这个逻辑或转换RDD到数据帧并使用排序依据
法说明
:
If you need mixed ordering (descending by some values, ascending by other) on non-numeric values you can either embed this logic inside keyfunc
or convert RDD to a DataFrame and use orderBy
method with desc
:
df.orderBy(desc("foo"), "bar")
这篇关于在Apache的星火多个字段排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!