PySpark takeOrdered多个字段(升序和降序) [英] PySpark takeOrdered Multiple Fields (Ascending and Descending)

查看:92
本文介绍了PySpark takeOrdered多个字段(升序和降序)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

pyspark.RDD中的takeOrdered方法从RDD中获取N个元素,该NDD元素按升序排列或由可选键函数指定,如此处所述

The takeOrdered Method from pyspark.RDD gets the N elements from an RDD ordered in ascending order or as specified by the optional key function as described here pyspark.RDD.takeOrdered. The example shows the following code with one key:

>>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x)
[10, 9, 7, 6, 5, 4]

是否还可以定义更多键,例如x,y,z用于具有3列的数据?

Is it also possible to define more keys e.g. x,y,z for data that has 3 columns?

键应采用不同的顺序,例如x = asc,y = desc,z = asc.这意味着,如果两行的第一个值x相等,则应按降序使用第二个值y.

The keys should be in different order such as x= asc, y= desc, z=asc. That means if the first value x of two rows are equal then the second value y should be used in descending order.

推荐答案

对于数字,您可以编写:

For numeric you could write:

n = 1
rdd = sc.parallelize([
    (-1, 99, 1), (-1, -99, -1), (5, 3, 8), (-1, 99, -1)
])

rdd.takeOrdered(n, lambda x: (x[0], -x[1], x[2]))
# [(-1, 99, -1)]

对于其他对象,您可以定义某种类型的记录类型并定义自己的一组丰富的比较方法:

For other objects you can define some type of record type and define your own set of rich comparison methods:

class XYZ(object):
    slots = ["x", "y", "z"]

    def __init__(self, x, y, z):
        self.x, self.y, self.z = x, y, z

    def __eq__(self, other):
        if not isinstance(other, XYZ):
            return False
        return self.x == other.x and self.y == other.y and self.z == other.z

    def __lt__(self, other):
        if not isinstance(other, XYZ):
            raise ValueError(
                "'<' not supported between instances of 'XYZ' and '{0}'".format(
                    type(other)
            ))
        if self.x == other.x:
            if self.y == other.y:
                return self.z < other.z
            else:
                return self.y > other.y
        else:
            return self.x < other.x

    def __repr__(self):
        return "XYZ({}, {}, {})".format(self.x, self.y, self.z)

    @classmethod
    def from_tuple(cls, xyz):
        x, y, z = xyz
        return cls(x, y, z)

然后:

from xyz import XYZ

rdd.map(XYZ.from_tuple).takeOrdered(n)
# [XYZ(-1, 99, -1)]

实际上只使用SQL:

from pyspark.sql.functions import asc, desc

rdd.toDF(["x", "y", "z"]).orderBy(asc("x"), desc("y"), asc("z")).take(n)
# [Row(x=-1, y=99, z=-1)]

这篇关于PySpark takeOrdered多个字段(升序和降序)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆