PySpark-添加一个按用户排名的新列 [英] PySpark - Add a new column with a Rank by User

查看：77 发布时间：2020/9/4 6:25:41 python apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了PySpark-添加一个按用户排名的新列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个PySpark DataFrame

I have this PySpark DataFrame

df = pd.DataFrame(np.array([
    ["aa@gmail.com",2,3], ["aa@gmail.com",5,5],
    ["bb@gmail.com",8,2], ["cc@gmail.com",9,3]
]), columns=['user','movie','rating'])

sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)

         user movie rating
aa@gmail.com     2      3
aa@gmail.com     5      5
bb@gmail.com     8      2
cc@gmail.com     9      3

我需要添加一个按用户排名的新列

I need to add a new column with a Rank by User

我想要这个输出

         user  movie rating  Rank
aa@gmail.com     2      3     1
aa@gmail.com     5      5     1
bb@gmail.com     8      2     2
cc@gmail.com     9      3     3

我该怎么办?

推荐答案

到目前为止，这里确实还没有优雅的解决方案.如果需要，您可以尝试这样的事情:

There is really no elegant solution here as for now. If you have to you can try something like this:

lookup = (sparkdf.select("user")
    .distinct()
    .orderBy("user")
    .rdd
    .zipWithIndex()
    .map(lambda x: x[0] + (x[1], ))
    .toDF(["user", "rank"]))

sparkdf.join(lookup, ["user"]).withColumn("rank", col("rank") + 1)

窗口函数替代更为简洁:

Window functions alternative is much more concise:

from pyspark.sql.functions import dense_rank

sparkdf.withColumn("rank", dense_rank().over(w))

但效率极低，在实践中应避免 .

but it is extremely inefficient and should be avoided in practice.

这篇关于PySpark-添加一个按用户排名的新列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark-添加一个按用户排名的新列 [英] PySpark - Add a new column with a Rank by User

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark-添加一个按用户排名的新列 [英] PySpark - Add a new column with a Rank by User

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭