Spark Window函数最后一个非空值 [英] Spark Window function last not null value
问题描述
我们有一个用于用户事件的时间序列数据库,如下所示:
We have a times series database for user events, which looks like the following:
timestamp user_id event ticke_type error_type
2019-06-06 14:33:31 user_a choose_ticket ticke_b NULL
2019-06-06 14:34:31 user_b choose_ticket ticke_f NULL
2019-06-06 14:36:31 user_a booing_error NULL error_c
2019-06-06 14:37:31 user_a choose_ticket ticke_h NULL
2019-06-06 14:38:31 user_a booing_error NULL error_d
2019-06-06 14:39:31 user_a booing_error NULL error_e
这是我们需要的一个用例:
This is one usecase that we need:
为了调查导致预订错误的机票类型,我们将不得不研究门票类型,仅适用于较早的事件 choose_ticket
.
In order to investigate what kind of ticket type is causing some booking error,
we will have to look into the ticket type,
which is available only on the earlier event choose_ticket
.
在这种情况下,我们要查找的是每个 booking_error
事件,找到之前的 choose_ticket
事件为同一用户并将那里的票证类型合并到 booking_error
事件.
In this case, what we are looking is for each booking_error
event, find the previous choose_ticket
event
for the same user and merge the ticket type there to the booking_error
event.
理想情况下,我们想要的输出是:
So ideally, the output that we want is:
timestamp user_id event ticke_type error_type
2019-06-06 14:36:31 user_a booing_error ticke_b error_c
2019-06-06 14:38:31 user_a booing_error ticke_h error_d
2019-06-06 14:39:31 user_a booing_error ticke_h error_e
我能找到的最接近的是 Spark 使用前一行的值向数据框中添加新列,这允许我们从前一个事件中获取属性并在之后立即将其应用到事件中.
The closest that I can find is Spark add new column to dataframe with value from previous row, which allows us to take the properties from a previous event and apply it to the event right after.
这几乎有效,除了当有多个事件(在本例中为booing_error
)时,在这种情况下只有第一个可以获得所需的属性.例如,这是我们将从上面的 SO 链接中获得的解决方案:
This works almost, except that when there are multiple events (booing_error
in this example), only the very first one can get the needed properties in this case.
e.g., this what we will get with the solution from the SO link above:
timestamp user_id event ticke_type error_type
2019-06-06 14:36:31 user_a booing_error ticke_b error_c
2019-06-06 14:38:31 user_a booing_error ticke_h error_d
2019-06-06 14:39:31 user_a booing_error NULL error_e
总而言之,对于给定的行,如何找到符合特定条件的前一行并挑选"其属性?
To sum up, for a given row, how to find the previous row matching certain criteria and "cherry-pick" its property over?
最好的方法是什么?
推荐答案
org.apache.spark.sql.functions.last
正是您要找的.您可以重命名最近的"列以在最后替换 tick_type.
org.apache.spark.sql.functions.last
is what you are looking for. You may rename the "closest" column to replace ticke_type in the end.
scala> df.show
+-------------------+-------+-------------+----------+----------+
| timestamp|user_id| event|ticke_type|error_type|
+-------------------+-------+-------------+----------+----------+
|2019-06-06 14:33:31| user_a|choose_ticket| ticke_b| null|
|2019-06-06 14:34:31| user_b|choose_ticket| ticke_f| null|
|2019-06-06 14:36:31| user_a|booking_error| null| error_c|
|2019-06-06 14:37:31| user_a|choose_ticket| ticke_h| null|
|2019-06-06 14:38:31| user_a|booking_error| null| error_d|
|2019-06-06 14:39:31| user_a|booking_error| null| error_e|
+-------------------+-------+-------------+----------+----------+
scala> val overColumns = Window.partitionBy("user_id").orderBy("timestamp")
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@70dc8c9a
scala> df.withColumn("closest",
org.apache.spark.sql.functions.last("ticke_type", true).over(overColumns)).filter($"event" === "booking_error").show
+-------------------+-------+-------------+----------+----------+-------+
| timestamp|user_id| event|ticke_type|error_type|closest|
+-------------------+-------+-------------+----------+----------+-------+
|2019-06-06 14:36:31| user_a|booking_error| null| error_c|ticke_b|
|2019-06-06 14:38:31| user_a|booking_error| null| error_d|ticke_h|
|2019-06-06 14:39:31| user_a|booking_error| null| error_e|ticke_h|
+-------------------+-------+-------------+----------+----------+-------+
这篇关于Spark Window函数最后一个非空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!