在Apache的星火多个字段排序 [英] Sorting by multiple fields in Apache Spark

查看:368
本文介绍了在Apache的星火多个字段排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在火花RDD。的RDD的每个元素是一个列表。
此外,所有的元素是类似的图案的列表,因此它有点像一个表。
我需要通过一些列排序的RDD,在一个特定的优先级顺序。

I have a RDD in spark. Each element of the RDD is a list. Moreover, all the elements are lists of similar pattern, so its kinda like a table. I need the RDD sorted by some columns, in a specific priority order.

我怎样才能做到这一点?

How can I achieve this?

PS:这是我试过

我试图领域具有最高优先级,然后按它来进行排序,然后每个结果由现场与第二最高优先级排序。我这样做递归,并加入了结果。
但是,使用RDD.groupBy这么多次讲得非常非常缓慢。

I tried to sort by the field with highest priority, then group by it, then sort each result by the field with second highest priority. I did this recursively, and joined the results. But, using RDD.groupBy so many times made it very very slow.

推荐答案

如果你想简单地按升序/降序排列有两件你需要做它的工作:

If you want to simply sort in ascending / descending order there are two pieces you need to make it work:


  • RDD。 rdd.sortBy 功能,由给定的 keyfunc 的RDD排序(...)

  • 知识Python的列表元组的字典顺序进行比较:

  • RDD.rdd.sortBy function which "sorts (...) RDD by the given keyfunc"
  • knowledge that Python lists and tuples are compared lexicographically:

>>> (1, 2) < (3, 4)
True
>>> (5, 6) < (3, 4)
False
>>> ("foo", 1) < ("foo", 2, 5)
True
>>> ("bar", 1, 2) > ("bar", 1)
True


只要在类似结合这两种rdd.sortBy(波长X:(X [0],X [3])),你是好去。

Simply combine these two in something like rdd.sortBy(lambda x: (x[0], x[3])) and you're good to go.

如果你需要混合排序(由一些值下降,被其他升序)非数值您可以嵌入在 keyfunc 这个逻辑或转换RDD到数据帧并使用排序依据说明

If you need mixed ordering (descending by some values, ascending by other) on non-numeric values you can either embed this logic inside keyfunc or convert RDD to a DataFrame and use orderBy method with desc:

df.orderBy(desc("foo"), "bar")

这篇关于在Apache的星火多个字段排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆