Spark Window函数最后一个非空值 [英] Spark Window function last not null value

查看:31
本文介绍了Spark Window函数最后一个非空值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个用于用户事件的时间序列数据库,如下所示:

We have a times series database for user events, which looks like the following:

timestamp             user_id     event            ticke_type     error_type 
2019-06-06 14:33:31   user_a      choose_ticket    ticke_b        NULL
2019-06-06 14:34:31   user_b      choose_ticket    ticke_f        NULL
2019-06-06 14:36:31   user_a      booing_error     NULL           error_c  
2019-06-06 14:37:31   user_a      choose_ticket    ticke_h        NULL
2019-06-06 14:38:31   user_a      booing_error     NULL           error_d
2019-06-06 14:39:31   user_a      booing_error     NULL           error_e

这是我们需要的一个用例:

This is one usecase that we need:

为了调查导致预订错误的机票类型,我们将不得不研究门票类型,仅适用于较早的事件 choose_ticket.

In order to investigate what kind of ticket type is causing some booking error, we will have to look into the ticket type, which is available only on the earlier event choose_ticket.

在这种情况下,我们要查找的是每个 booking_error 事件,找到之前的 choose_ticket 事件为同一用户并将那里的票证类型合并到 booking_error 事件.

In this case, what we are looking is for each booking_error event, find the previous choose_ticket event for the same user and merge the ticket type there to the booking_error event.

理想情况下,我们想要的输出是:

So ideally, the output that we want is:

timestamp             user_id     event            ticke_type     error_type 
2019-06-06 14:36:31   user_a      booing_error     ticke_b        error_c  
2019-06-06 14:38:31   user_a      booing_error     ticke_h        error_d
2019-06-06 14:39:31   user_a      booing_error     ticke_h        error_e

我能找到的最接近的是 Spark 使用前一行的值向数据框中添加新列,这允许我们从前一个事件中获取属性并在之后立即将其应用到事件中.

The closest that I can find is Spark add new column to dataframe with value from previous row, which allows us to take the properties from a previous event and apply it to the event right after.

这几乎有效,除了当有多个事件(在本例中为booing_error)时,在这种情况下只有第一个可以获得所需的属性.例如,这是我们将从上面的 SO 链接中获得的解决方案:

This works almost, except that when there are multiple events (booing_error in this example), only the very first one can get the needed properties in this case. e.g., this what we will get with the solution from the SO link above:

timestamp             user_id     event            ticke_type     error_type 
2019-06-06 14:36:31   user_a      booing_error     ticke_b        error_c  
2019-06-06 14:38:31   user_a      booing_error     ticke_h        error_d
2019-06-06 14:39:31   user_a      booing_error     NULL           error_e

总而言之,对于给定的行,如何找到符合特定条件的前一行并挑选"其属性?

To sum up, for a given row, how to find the previous row matching certain criteria and "cherry-pick" its property over?

最好的方法是什么?

推荐答案

org.apache.spark.sql.functions.last 正是您要找的.您可以重命名最近的"列以在最后替换 tick_type.

org.apache.spark.sql.functions.last is what you are looking for. You may rename the "closest" column to replace ticke_type in the end.

scala> df.show
+-------------------+-------+-------------+----------+----------+
|          timestamp|user_id|        event|ticke_type|error_type|
+-------------------+-------+-------------+----------+----------+
|2019-06-06 14:33:31| user_a|choose_ticket|   ticke_b|      null|
|2019-06-06 14:34:31| user_b|choose_ticket|   ticke_f|      null|
|2019-06-06 14:36:31| user_a|booking_error|      null|   error_c|
|2019-06-06 14:37:31| user_a|choose_ticket|   ticke_h|      null|
|2019-06-06 14:38:31| user_a|booking_error|      null|   error_d|
|2019-06-06 14:39:31| user_a|booking_error|      null|   error_e|
+-------------------+-------+-------------+----------+----------+

scala> val overColumns = Window.partitionBy("user_id").orderBy("timestamp")
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@70dc8c9a

scala> df.withColumn("closest", 
  org.apache.spark.sql.functions.last("ticke_type", true).over(overColumns)).filter($"event" === "booking_error").show
+-------------------+-------+-------------+----------+----------+-------+
|          timestamp|user_id|        event|ticke_type|error_type|closest|
+-------------------+-------+-------------+----------+----------+-------+
|2019-06-06 14:36:31| user_a|booking_error|      null|   error_c|ticke_b|
|2019-06-06 14:38:31| user_a|booking_error|      null|   error_d|ticke_h|
|2019-06-06 14:39:31| user_a|booking_error|      null|   error_e|ticke_h|
+-------------------+-------+-------------+----------+----------+-------+

这篇关于Spark Window函数最后一个非空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆