获取Spark RDD中每个键的前3个值 [英] Get Top 3 values for every key in a RDD in Spark
问题描述
我是Spark的初学者,我正在尝试创建一个RDD,其中包含每个键的前3个值(而不仅仅是前3个值).我当前的RDD包含数千种以下格式的条目:
I'm a beginner with Spark and I am trying to create an RDD that contains the top 3 values for every key, (Not just the top 3 values). My current RDD contains thousands of entries in the following format:
(key, String, value)
因此,想象一下我有一个包含以下内容的RDD:
So imagine I had an RDD with content like this:
[("K1", "aaa", 6), ("K1", "bbb", 3), ("K1", "ccc", 2), ("K1", "ddd", 9),
("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)]
我目前可以在RDD中显示前3个值,如下所示:
I can currently display the top 3 values in the RDD like so:
("K1", "ddd", 9)
("B1", "iop", 8)
("B1", "rty", 7)
使用:
top3RDD = rdd.takeOrdered(3, key = lambda x: x[2])
相反,我想要的是为RDD中的每个键收集前3个值,所以我想将其返回:
Instead what I want is to gather the top 3 values for every key in the RDD so I would like to return this instead:
("K1", "ddd", 9)
("K1", "aaa", 6)
("K1", "bbb", 3)
("B1", "iop", 8)
("B1", "rty", 7)
("B1", "qwe", 4)
推荐答案
您需要对key
进行分组,然后可以使用heapq.nlargest
从每个组中获取前3个值:
You need to groupBy the key
and then you can use heapq.nlargest
to take the top 3 values from each group:
from heapq import nlargest
rdd.groupBy(
lambda x: x[0]
).flatMap(
lambda g: nlargest(3, g[1], key=lambda x: x[2])
).collect()
[('B1', 'iop', 8),
('B1', 'rty', 7),
('B1', 'qwe', 4),
('K1', 'ddd', 9),
('K1', 'aaa', 6),
('K1', 'bbb', 3)]
这篇关于获取Spark RDD中每个键的前3个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!