获取Spark RDD中每个键的前3个值 [英] Get Top 3 values for every key in a RDD in Spark

查看:399
本文介绍了获取Spark RDD中每个键的前3个值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark的初学者,我正在尝试创建一个RDD,其中包含每个键的前3个值(而不仅仅是前3个值).我当前的RDD包含数千种以下格式的条目:

I'm a beginner with Spark and I am trying to create an RDD that contains the top 3 values for every key, (Not just the top 3 values). My current RDD contains thousands of entries in the following format:

(key, String, value)

因此,想象一下我有一个包含以下内容的RDD:

So imagine I had an RDD with content like this:

[("K1", "aaa", 6), ("K1", "bbb", 3), ("K1", "ccc", 2), ("K1", "ddd", 9),
("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)]

我目前可以在RDD中显示前3个值,如下所示:

I can currently display the top 3 values in the RDD like so:

("K1", "ddd", 9)
("B1", "iop", 8)
("B1", "rty", 7)

使用:

top3RDD = rdd.takeOrdered(3, key = lambda x: x[2])

相反,我想要的是为RDD中的每个键收集前3个值,所以我想将其返回:

Instead what I want is to gather the top 3 values for every key in the RDD so I would like to return this instead:

("K1", "ddd", 9)
("K1", "aaa", 6)
("K1", "bbb", 3)
("B1", "iop", 8)
("B1", "rty", 7)
("B1", "qwe", 4)

推荐答案

您需要对key进行分组,然后可以使用heapq.nlargest从每个组中获取前3个值:

You need to groupBy the key and then you can use heapq.nlargest to take the top 3 values from each group:

from heapq import nlargest
rdd.groupBy(
    lambda x: x[0]
).flatMap(
    lambda g: nlargest(3, g[1], key=lambda x: x[2])
).collect()

[('B1', 'iop', 8), 
 ('B1', 'rty', 7), 
 ('B1', 'qwe', 4), 
 ('K1', 'ddd', 9), 
 ('K1', 'aaa', 6), 
 ('K1', 'bbb', 3)]

这篇关于获取Spark RDD中每个键的前3个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆