如果在Pyspark到目前为止还有其他所有事件,那么如何捕获第一个事件? [英] How to capture upto first event if event there else all events till now in Pyspark?

查看:62
本文介绍了如果在Pyspark到目前为止还有其他所有事件,那么如何捕获第一个事件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个针对用户的应用点击数据,该数据会实时不断更新

I have a click data of an app for users which is continuously updated in real time

Used id Events  Timestamp
u1     login    9:01
u1     start    9:05
u1     buy      9:10
u2     login    11:33
u2     cart     11:40
u3     login    15:03
u3     buy      15:10
u1     login    17:25
u1     buy      17:35
u4     login    18:33
u4     news     18:35
u3     news     19:09
u5     notifications    20:10

预期的输出将具有直到第一个购买事件的数据,或者如果不存在购买事件,则直到现在的所有事件的数据,即 current_timestamp().

Expected output will have data upto first buy event or if buy not present then up to all events till now i.e. current_timestamp().

Used id Events  Timestamp
u1     login    9:01
u1     start    9:05
u1     buy      9:10
u2     login    11:33
u2     cart     11:40
u3     login    15:03
u3     buy      15:10
u4     login    18:33
u4     news     18:35
u5     notifications    20:10

我想捕获仅购买前的用户的实时状态,并且不想为我的机器学习用例(即首次购买)添加事件后.我不知道对这种类型的数据做正确的方法

I want to capture the real time state of users upto first buy only and don't want to add after events for my machine learning use case which is first buy. I don't know that's right way to do for this type of data

推荐答案

您可以在分区窗口上使用 case min 来获得第一个购买时间戳,同时使用当前时间戳作为后备,以防万一没有购买事件.然后,您可以使用早于该时间戳记的时间来过滤行.我使用了IST时区来匹配您所在的位置(根据您的个人资料).

You can use a case when with min over a partitioned window to get the first buy timestamp, while using the current timestamp as the fallback in case there were no buy events. Then you can filter the rows with the time earlier than that timestamp. I used IST time zone to match where you are (according to your profile).

from pyspark.sql import functions as F, Window

spark.sql('set spark.sql.session.timeZone = IST')

result = df.withColumn(
    'first_buy',
    F.date_format(
        F.coalesce(
            F.min(
                F.when(
                    F.col('Events') == 'buy', F.col('Timestamp').cast('timestamp')
                )
            ).over(Window.partitionBy('User_id')),
            F.current_timestamp()
        ),
       'H:mm'
   )
).filter(
    'timestamp(Timestamp) <= timestamp(first_buy)'
).drop('first_buy').orderBy('User_id', F.col('Timestamp').cast('timestamp'))

result.show()
+-------+-------------+---------+
|User_id|       Events|Timestamp|
+-------+-------------+---------+
|     u1|        login|     9:01|
|     u1|        start|     9:05|
|     u1|          buy|     9:10|
|     u2|        login|    11:33|
|     u2|         cart|    11:40|
|     u3|        login|    15:03|
|     u3|          buy|    15:10|
|     u4|        login|    18:33|
|     u4|         news|    18:35|
|     u5|notifications|    20:10|
+-------+-------------+---------+

这篇关于如果在Pyspark到目前为止还有其他所有事件,那么如何捕获第一个事件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆