获取Spark RDD中每个键的前3个值 [英] Get Top 3 values for every key in a RDD in Spark

查看：399 发布时间：2020/9/4 4:04:30 python-2.7 apache-spark lambda pyspark rdd

本文介绍了获取Spark RDD中每个键的前3个值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是Spark的初学者，我正在尝试创建一个RDD，其中包含每个键的前3个值(而不仅仅是前3个值).我当前的RDD包含数千种以下格式的条目:

I'm a beginner with Spark and I am trying to create an RDD that contains the top 3 values for every key, (Not just the top 3 values). My current RDD contains thousands of entries in the following format:

(key, String, value)

因此，想象一下我有一个包含以下内容的RDD:

So imagine I had an RDD with content like this:

[("K1", "aaa", 6), ("K1", "bbb", 3), ("K1", "ccc", 2), ("K1", "ddd", 9),
("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)]

我目前可以在RDD中显示前3个值，如下所示:

I can currently display the top 3 values in the RDD like so:

("K1", "ddd", 9)
("B1", "iop", 8)
("B1", "rty", 7)

使用:

top3RDD = rdd.takeOrdered(3, key = lambda x: x[2])

相反，我想要的是为RDD中的每个键收集前3个值，所以我想将其返回:

Instead what I want is to gather the top 3 values for every key in the RDD so I would like to return this instead:

("K1", "ddd", 9)
("K1", "aaa", 6)
("K1", "bbb", 3)
("B1", "iop", 8)
("B1", "rty", 7)
("B1", "qwe", 4)

推荐答案

您需要对key进行分组，然后可以使用heapq.nlargest从每个组中获取前3个值:

You need to groupBy the key and then you can use heapq.nlargest to take the top 3 values from each group:

from heapq import nlargest
rdd.groupBy(
    lambda x: x[0]
).flatMap(
    lambda g: nlargest(3, g[1], key=lambda x: x[2])
).collect()

[('B1', 'iop', 8), 
 ('B1', 'rty', 7), 
 ('B1', 'qwe', 4), 
 ('K1', 'ddd', 9), 
 ('K1', 'aaa', 6), 
 ('K1', 'bbb', 3)]

这篇关于获取Spark RDD中每个键的前3个值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

获取Spark RDD中每个键的前3个值 [英] Get Top 3 values for every key in a RDD in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

获取Spark RDD中每个键的前3个值 [英] Get Top 3 values for every key in a RDD in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭