在pyspark中检索每个DataFrame组中的前n个 [英] Retrieve top n in each group of a DataFrame in pyspark
问题描述
pyspark中有一个DataFrame,其数据如下:
There's a DataFrame in pyspark with data as below:
user_id object_id score
user_1 object_1 3
user_1 object_1 1
user_1 object_2 2
user_2 object_1 5
user_2 object_2 2
user_2 object_2 6
我期望在每个组中返回2条具有相同user_id的记录,这些记录必须具有最高的得分.因此,结果应如下所示:
What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following:
user_id object_id score
user_1 object_1 3
user_1 object_2 2
user_2 object_2 6
user_2 object_1 5
我真的是pyspark的新手,有人可以给我代码段或门户网站有关此问题的相关文档吗?非常感谢!
I'm really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem? Great thanks!
推荐答案
I believe you need to use window functions to attain the rank of each row based on user_id
and score
, and subsequently filter your results to only keep the first two values.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
window = Window.partitionBy(df['user_id']).orderBy(df['score'].desc())
df.select('*', rank().over(window).alias('rank'))
.filter(col('rank') <= 2)
.show()
#+-------+---------+-----+----+
#|user_id|object_id|score|rank|
#+-------+---------+-----+----+
#| user_1| object_1| 3| 1|
#| user_1| object_2| 2| 2|
#| user_2| object_2| 6| 1|
#| user_2| object_1| 5| 2|
#+-------+---------+-----+----+
通常,官方的编程指南是一个不错的选择开始学习Spark.
In general, the official programming guide is a good place to start learning Spark.
rdd = sc.parallelize([("user_1", "object_1", 3),
("user_1", "object_2", 2),
("user_2", "object_1", 5),
("user_2", "object_2", 2),
("user_2", "object_2", 6)])
df = sqlContext.createDataFrame(rdd, ["user_id", "object_id", "score"])
这篇关于在pyspark中检索每个DataFrame组中的前n个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!