如果在Pyspark到目前为止还有其他所有事件,那么如何捕获第一个事件? [英] How to capture upto first event if event there else all events till now in Pyspark?
问题描述
我有一个针对用户的应用点击数据,该数据会实时不断更新
I have a click data of an app for users which is continuously updated in real time
Used id Events Timestamp
u1 login 9:01
u1 start 9:05
u1 buy 9:10
u2 login 11:33
u2 cart 11:40
u3 login 15:03
u3 buy 15:10
u1 login 17:25
u1 buy 17:35
u4 login 18:33
u4 news 18:35
u3 news 19:09
u5 notifications 20:10
预期的输出将具有直到第一个购买事件的数据,或者如果不存在购买事件,则直到现在的所有事件的数据,即 current_timestamp()
.
Expected output will have data upto first buy event or if buy not present then up to all events till now i.e. current_timestamp()
.
Used id Events Timestamp
u1 login 9:01
u1 start 9:05
u1 buy 9:10
u2 login 11:33
u2 cart 11:40
u3 login 15:03
u3 buy 15:10
u4 login 18:33
u4 news 18:35
u5 notifications 20:10
我想捕获仅购买前的用户的实时状态,并且不想为我的机器学习用例(即首次购买)添加事件后.我不知道对这种类型的数据做正确的方法
I want to capture the real time state of users upto first buy only and don't want to add after events for my machine learning use case which is first buy. I don't know that's right way to do for this type of data
推荐答案
您可以在分区窗口上使用 case
和 min
来获得第一个购买时间戳,同时使用当前时间戳作为后备,以防万一没有购买事件.然后,您可以使用早于该时间戳记的时间来过滤行.我使用了IST时区来匹配您所在的位置(根据您的个人资料).
You can use a case when
with min
over a partitioned window to get the first buy timestamp, while using the current timestamp as the fallback in case there were no buy events. Then you can filter the rows with the time earlier than that timestamp. I used IST time zone to match where you are (according to your profile).
from pyspark.sql import functions as F, Window
spark.sql('set spark.sql.session.timeZone = IST')
result = df.withColumn(
'first_buy',
F.date_format(
F.coalesce(
F.min(
F.when(
F.col('Events') == 'buy', F.col('Timestamp').cast('timestamp')
)
).over(Window.partitionBy('User_id')),
F.current_timestamp()
),
'H:mm'
)
).filter(
'timestamp(Timestamp) <= timestamp(first_buy)'
).drop('first_buy').orderBy('User_id', F.col('Timestamp').cast('timestamp'))
result.show()
+-------+-------------+---------+
|User_id| Events|Timestamp|
+-------+-------------+---------+
| u1| login| 9:01|
| u1| start| 9:05|
| u1| buy| 9:10|
| u2| login| 11:33|
| u2| cart| 11:40|
| u3| login| 15:03|
| u3| buy| 15:10|
| u4| login| 18:33|
| u4| news| 18:35|
| u5|notifications| 20:10|
+-------+-------------+---------+
这篇关于如果在Pyspark到目前为止还有其他所有事件,那么如何捕获第一个事件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!