如果事件还有 Pyspark 中的所有事件,如何捕获到第一个事件? [英] How to capture upto first event if event there else all events till now in Pyspark?
问题描述
我有一个用户的应用点击数据,实时不断更新
I have a click data of an app for users which is continuously updated in real time
Used id Events Timestamp
u1 login 9:01
u1 start 9:05
u1 buy 9:10
u2 login 11:33
u2 cart 11:40
u3 login 15:03
u3 buy 15:10
u1 login 17:25
u1 buy 17:35
u4 login 18:33
u4 news 18:35
u3 news 19:09
u5 notifications 20:10
预期输出将包含第一次购买事件的数据,或者如果购买不存在则直到现在的所有事件,即current_timestamp()
.
Expected output will have data upto first buy event or if buy not present then up to all events till now i.e. current_timestamp()
.
Used id Events Timestamp
u1 login 9:01
u1 start 9:05
u1 buy 9:10
u2 login 11:33
u2 cart 11:40
u3 login 15:03
u3 buy 15:10
u4 login 18:33
u4 news 18:35
u5 notifications 20:10
我只想捕获用户在首次购买前的实时状态,不想为我的机器学习用例(首次购买)添加事后事件.我不知道这是处理此类数据的正确方法
I want to capture the real time state of users upto first buy only and don't want to add after events for my machine learning use case which is first buy. I don't know that's right way to do for this type of data
推荐答案
您可以在分区窗口上使用 case when
和 min
来获取第一个购买时间戳,同时使用当前时间戳作为后备,以防没有购买事件.然后您可以过滤时间早于该时间戳的行.我使用 IST 时区来匹配您所在的位置(根据您的个人资料).
You can use a case when
with min
over a partitioned window to get the first buy timestamp, while using the current timestamp as the fallback in case there were no buy events. Then you can filter the rows with the time earlier than that timestamp. I used IST time zone to match where you are (according to your profile).
from pyspark.sql import functions as F, Window
spark.sql('set spark.sql.session.timeZone = IST')
result = df.withColumn(
'first_buy',
F.date_format(
F.coalesce(
F.min(
F.when(
F.col('Events') == 'buy', F.col('Timestamp').cast('timestamp')
)
).over(Window.partitionBy('User_id')),
F.current_timestamp()
),
'H:mm'
)
).filter(
'timestamp(Timestamp) <= timestamp(first_buy)'
).drop('first_buy').orderBy('User_id', F.col('Timestamp').cast('timestamp'))
result.show()
+-------+-------------+---------+
|User_id| Events|Timestamp|
+-------+-------------+---------+
| u1| login| 9:01|
| u1| start| 9:05|
| u1| buy| 9:10|
| u2| login| 11:33|
| u2| cart| 11:40|
| u3| login| 15:03|
| u3| buy| 15:10|
| u4| login| 18:33|
| u4| news| 18:35|
| u5|notifications| 20:10|
+-------+-------------+---------+
这篇关于如果事件还有 Pyspark 中的所有事件,如何捕获到第一个事件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!