具有 Preceding Till 特定值的分析函数 [英] Analytical function with Preceding Till a Particular Value

查看：53 发布时间：2021/6/25 18:32:15 apache-spark pyspark

本文介绍了具有 Preceding Till 特定值的分析函数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有如下数据

arrayData = [
  ('abc','PN1','SN1','2021-02-03 10:20:11','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:20:15','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:20:19','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:21:11','2021-02-03 10:21:19','Success'),
  ('abc','PN1','SN1','2021-02-03 10:22:19','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:22:29','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:22:39','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:22:49','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:22:59','','Fail'),
  ('abc','PN1','SN1','2021-02-03 10:31:11','2021-02-03 10:31:19','Success'),
  ('abc','PN1','SN1','2021-02-03 10:31:21','2021-02-03 10:32:19','Success'),
  ('abc','PN1','SN1','2021-02-03 11:32:49','','Fail'),
  ('abc','PN1','SN1','2021-02-03 11:34:59','','Fail'),
  ('abc','PN1','SN2','2021-02-03 10:22:49','','Fail'),
  ('abc','PN1','SN2','2021-02-03 10:22:59','','Fail')
]
root
 |-- event: string (nullable = true)
 |-- PN: string (nullable = true)
 |-- SN: string (nullable = true)
 |-- Claim_Start: string (nullable = true)
 |-- Claim_End: string (nullable = true)
 |-- Status: string (nullable = true)

+-----+---+---+-------------------+-------------------+-------+
|event| PN| SN|        Claim_Start|          Claim_End| Status|
+-----+---+---+-------------------+-------------------+-------+
|  abc|PN1|SN1|2021-02-03 10:20:11|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:20:15|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:20:19|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:21:11|2021-02-03 10:21:19|Success|
|  abc|PN1|SN1|2021-02-03 10:22:19|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:22:29|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:22:39|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:22:49|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:22:59|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 10:31:11|2021-02-03 10:31:19|Success|
|  abc|PN1|SN1|2021-02-03 10:31:21|2021-02-03 10:32:19|Success|
|  abc|PN1|SN1|2021-02-03 11:32:49|                   |   Fail|
|  abc|PN1|SN1|2021-02-03 11:34:59|                   |   Fail|
|  abc|PN1|SN2|2021-02-03 10:22:49|                   |   Fail|
|  abc|PN1|SN2|2021-02-03 10:22:59|                   |   Fail|
+-----+---+---+-------------------+-------------------+-------+

我做了一些如下的转换

df2 = df.withColumn("event_start_time",f.to_timestamp(df.Claim_Start,'yyyy-MM-dd HH:mm:ss')).withColumn("event_end_time",f.to_timestamp(df.Claim_End,'yyyy-MM-dd HH:mm:ss'))
df2 = df2.drop("Claim_Start").drop("Claim_End")
+-----+---+---+-------+-------------------+-------------------+
|event| PN| SN| Status|   event_start_time|     event_end_time|
+-----+---+---+-------+-------------------+-------------------+
|  abc|PN1|SN1|   Fail|2021-02-03 10:20:11|               null|
|  abc|PN1|SN1|   Fail|2021-02-03 10:20:15|               null|
|  abc|PN1|SN1|   Fail|2021-02-03 10:20:19|               null|
|  abc|PN1|SN1|Success|2021-02-03 10:21:11|2021-02-03 10:21:19|
|  abc|PN1|SN1|   Fail|2021-02-03 10:22:19|               null|
|  abc|PN1|SN1|   Fail|2021-02-03 10:22:29|               null|
|  abc|PN1|SN1|   Fail|2021-02-03 10:22:39|               null|
|  abc|PN1|SN1|   Fail|2021-02-03 10:22:49|               null|
|  abc|PN1|SN1|   Fail|2021-02-03 10:22:59|               null|
|  abc|PN1|SN1|Success|2021-02-03 10:31:11|2021-02-03 10:31:19|
|  abc|PN1|SN1|Success|2021-02-03 10:31:21|2021-02-03 10:32:19|
|  abc|PN1|SN1|   Fail|2021-02-03 11:32:49|               null|
|  abc|PN1|SN1|   Fail|2021-02-03 11:34:59|               null|
|  abc|PN1|SN2|   Fail|2021-02-03 10:22:49|               null|
|  abc|PN1|SN2|   Fail|2021-02-03 10:22:59|               null|
+-----+---+---+-------+-------------------+-------------------+

我需要的输出

+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
| PN| SN|event| status|   event_start_time|     event_end_time|          first_try|num_attempts|
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
|PN1|SN1|  abc|   Fail|2021-02-03 11:32:49|                   |2021-02-03 11:32:49|           2|
|PN1|SN1|  abc|Success|2021-02-03 10:21:11|2021-02-03 10:21:19|2021-02-03 10:20:11|           4|
|PN1|SN1|  abc|Success|2021-02-03 10:31:11|2021-02-03 10:31:19|2021-02-03 10:22:29|           6|
|PN1|SN1|  abc|Success|2021-02-03 10:31:21|2021-02-03 10:32:19|               null|           1|
|PN1|SN2|  abc|   Fail|2021-02-03 10:22:49|                   |2021-02-03 10:22:49|           2|
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+

输出逻辑，

成功事件发生在 SN1 记录的第四条记录中，成功在第四次尝试中获得，第一次尝试发生在2021-02-03 10:20:11".如果最后一次失败，我们将结束日期保留为 null 并计数.

The Success event happens in the fourth record for SN1 record, and the success was got in was in 4th try, and the first try happened at "2021-02-03 10:20:11". In case of last failures we keep the end date as null and count.

有什么方法可以使用分析函数并返回到失败，即不包括成功.

Is there any way i can use the analytical functions and traverse back till the failures i.e. not including the success.

感谢任何帮助.

推荐答案

基本上我们需要一个窗口函数，然后将 Status 列滞后到从组中.按这些分组并应用您指出的逻辑会产生以下代码:

Basically we need a window function, then lag over the Status column to from the groups. Grouping by these and applying the logic you pointed out results in the following code:

from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql import types as T

w = Window.partitionBy('PN', 'SN').orderBy('Claim_Start')
df.withColumn('lagged', F.lag('Status').over(w))\
   .withColumn('status_flag', F.when(F.col('lagged') == 'Success', 1).otherwise(0))\
   .withColumn('group', F.sum('status_flag').over(w))\
   .groupBy('PN', 'SN', 'event','group').agg(F.last('Status').alias('status'),
                                             F.when(F.last('Status') == 'Success', F.last('Claim_Start')).otherwise(F.first('Claim_Start')).alias('event_start_time'),
                                             F.last('Claim_End').alias('event_end_time'),
                                             F.when(F.count('Claim_Start') > 1, F.first('Claim_Start')).otherwise(None).alias('first_try'),
                                             F.count('Claim_Start').alias('num_attempts')
                                             ).drop('group').orderBy('PN', 'SN', 'event_end_time').show()

结果

+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
| PN| SN|event| status|   event_start_time|     event_end_time|          first_try|num_attempts|
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
|PN1|SN1|  abc|   Fail|2021-02-03 11:32:49|                   |2021-02-03 11:32:49|           2|
|PN1|SN1|  abc|Success|2021-02-03 10:21:11|2021-02-03 10:21:19|2021-02-03 10:20:11|           4|
|PN1|SN1|  abc|Success|2021-02-03 10:31:11|2021-02-03 10:31:19|2021-02-03 10:22:19|           6|
|PN1|SN1|  abc|Success|2021-02-03 10:31:21|2021-02-03 10:32:19|               null|           1|
|PN1|SN2|  abc|   Fail|2021-02-03 10:22:49|                   |2021-02-03 10:22:49|           2|
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+

这篇关于具有 Preceding Till 特定值的分析函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

具有 Preceding Till 特定值的分析函数 [英] Analytical function with Preceding Till a Particular Value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

具有 Preceding Till 特定值的分析函数 [英] Analytical function with Preceding Till a Particular Value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭