如果事件还有 Pyspark 中的所有事件,如何捕获到第一个事件? [英] How to capture upto first event if event there else all events till now in Pyspark?

查看:23
本文介绍了如果事件还有 Pyspark 中的所有事件,如何捕获到第一个事件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用户的应用点击数据,实时不断更新

I have a click data of an app for users which is continuously updated in real time

Used id Events  Timestamp
u1     login    9:01
u1     start    9:05
u1     buy      9:10
u2     login    11:33
u2     cart     11:40
u3     login    15:03
u3     buy      15:10
u1     login    17:25
u1     buy      17:35
u4     login    18:33
u4     news     18:35
u3     news     19:09
u5     notifications    20:10

预期输出将包含第一次购买事件的数据,或者如果购买不存在则直到现在的所有事件,即current_timestamp().

Expected output will have data upto first buy event or if buy not present then up to all events till now i.e. current_timestamp().

Used id Events  Timestamp
u1     login    9:01
u1     start    9:05
u1     buy      9:10
u2     login    11:33
u2     cart     11:40
u3     login    15:03
u3     buy      15:10
u4     login    18:33
u4     news     18:35
u5     notifications    20:10

我只想捕获用户在首次购买前的实时状态,不想为我的机器学习用例(首次购买)添加事后事件.我不知道这是处理此类数据的正确方法

I want to capture the real time state of users upto first buy only and don't want to add after events for my machine learning use case which is first buy. I don't know that's right way to do for this type of data

推荐答案

您可以在分区窗口上使用 case whenmin 来获取第一个购买时间戳,同时使用当前时间戳作为后备,以防没有购买事件.然后您可以过滤时间早于该时间戳的行.我使用 IST 时区来匹配您所在的位置(根据您的个人资料).

You can use a case when with min over a partitioned window to get the first buy timestamp, while using the current timestamp as the fallback in case there were no buy events. Then you can filter the rows with the time earlier than that timestamp. I used IST time zone to match where you are (according to your profile).

from pyspark.sql import functions as F, Window

spark.sql('set spark.sql.session.timeZone = IST')

result = df.withColumn(
    'first_buy',
    F.date_format(
        F.coalesce(
            F.min(
                F.when(
                    F.col('Events') == 'buy', F.col('Timestamp').cast('timestamp')
                )
            ).over(Window.partitionBy('User_id')),
            F.current_timestamp()
        ),
       'H:mm'
   )
).filter(
    'timestamp(Timestamp) <= timestamp(first_buy)'
).drop('first_buy').orderBy('User_id', F.col('Timestamp').cast('timestamp'))

result.show()
+-------+-------------+---------+
|User_id|       Events|Timestamp|
+-------+-------------+---------+
|     u1|        login|     9:01|
|     u1|        start|     9:05|
|     u1|          buy|     9:10|
|     u2|        login|    11:33|
|     u2|         cart|    11:40|
|     u3|        login|    15:03|
|     u3|          buy|    15:10|
|     u4|        login|    18:33|
|     u4|         news|    18:35|
|     u5|notifications|    20:10|
+-------+-------------+---------+

这篇关于如果事件还有 Pyspark 中的所有事件,如何捕获到第一个事件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆