Pyspark-GroupBy和Count与WHERE结合 [英] Pyspark - GroupBy and Count combined with a WHERE

查看：253 发布时间：2021/4/8 20:07:46 pandas python-2.7 apache-spark group-by pyspark

本文介绍了Pyspark-GroupBy和Count与WHERE结合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

说我有一份杂志订阅清单，像这样:

Say I have a list of magazine subscriptions, like so:

subscription_id    user_id       created_at
 12384               1           2018-08-10
 83294               1           2018-06-03
 98234               1           2018-04-08
 24903               2           2018-05-08
 32843               2           2018-03-06
 09283               2           2018-04-07

现在，我想添加一列，指出用户在此当前订阅之前有多少个先前的订阅.例如，如果这是用户的第一个预订，则新列的值应为0.如果他们在此预订之前开始有一个预订，则新列的值应为1.这是完整的期望输出:

Now I want to add a column that states how many previous subscriptions a user had, before this current subscription. For example, if this is the user's first subscription, the new column's value should be 0. If they had one subscription starting before this subscription, the new column's value should be 1. Here is the full desired output:

subscription_id    user_id       created_at        users_previous_subs
 12384               1           2018-08-10                  2
 83294               1           2018-06-03                  1
 98234               1           2018-04-08                  0
 24903               2           2018-05-08                  2
 32843               2           2018-04-06                  1
 09283               2           2018-03-07                  0

我该如何做到这一点，最好是在PySpark中，因此不使用 shift

How can I accomplish this, preferably in PySpark, therefore not using shift

让我知道是否不清楚.谢谢！

Let me know if this is not clear. Thanks!!

推荐答案

这可以归结为行编号的计算.

from pyspark.sql import Window
from pyspark.sql import functions as func
#Define a window
w = Window.partitionBy(df.user_id).orderBy(df.created_at)
#Add an extra column with rownumber
df.withColumn('prev_subs',func.row_number().over(w)-1)
df.show()

如果可以建立联系(即，给定日期的用户超过1行)，请使用 dense_rank .

If there can be ties, (i.e. more than 1 row with a given date for a user), use dense_rank.

df.withColumn('prev_subs',func.dense_rank().over(w)-1)

这篇关于Pyspark-GroupBy和Count与WHERE结合的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark-GroupBy和Count与WHERE结合 [英] Pyspark - GroupBy and Count combined with a WHERE

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pyspark-GroupBy和Count与WHERE结合 [英] Pyspark - GroupBy and Count combined with a WHERE

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭