如何计算上一次满足条件之间的天数? [英] How to calculate days between when last condition was met?

查看:90
本文介绍了如何计算上一次满足条件之间的天数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前df:

df = spark.createDataFrame([
    ("2020-01-12","d1",0),
    ("2020-01-12","d2",0),
    ("2020-01-13","d3",0),
    ("2020-01-14","d4",1), 
    ("2020-01-15","d5",0),
    ("2020-01-15","d6",0),
    ("2020-01-16","d7",0),
    ("2020-01-17","d8",0),
    ("2020-01-18","d9",1),
    ("2020-01-19","d10",0),
    ("2020-01-20","d11",0),], 
    ['date', 'device', 'condition'])

df.show()

+----------+------+---------+
|      date|device|condition|
+----------+------+---------+
|2020-01-12|    d1|        0|
|2020-01-12|    d2|        0|
|2020-01-13|    d3|        0|
|2020-01-14|    d4|        1|
|2020-01-15|    d5|        0|
|2020-01-15|    d6|        0|
|2020-01-16|    d7|        0|
|2020-01-17|    d8|        0|
|2020-01-18|    d9|        1|
|2020-01-19|   d10|        0|
|2020-01-20|   d11|        0|
+----------+------+---------+

所需的输出df:

want_df = spark.createDataFrame([
    ("2020-01-12","d1",0,0),
    ("2020-01-12","d2",0,0),
    ("2020-01-13","d3",0,1),
    ("2020-01-14","d4",1,2), 
    ("2020-01-15","d5",0,1),
    ("2020-01-15","d6",0,1),
    ("2020-01-16","d7",0,2),
    ("2020-01-17","d8",0,3),
    ("2020-01-18","d9",1,4),
    ("2020-01-19","d10",0,1),
    ("2020-01-20","d11",0,2),], 
    ['date', 'device', 'condition', 'life'])

want_df.show()

+----------+------+---------+----+
|      date|device|condition|life|
+----------+------+---------+----+
|2020-01-12|    d1|        0|   0|
|2020-01-12|    d2|        0|   0|
|2020-01-13|    d3|        0|   1|
|2020-01-14|    d4|        1|   2|
|2020-01-15|    d5|        0|   1|
|2020-01-15|    d6|        0|   1|
|2020-01-16|    d7|        0|   2|
|2020-01-17|    d8|        0|   3|
|2020-01-18|    d9|        1|   4|
|2020-01-19|   d10|        0|   1|
|2020-01-20|   d11|        0|   2|
+----------+------+---------+----+

目标是计算直到 condition = 1 为止的日期差(天数),然后将日期差重设为从满足上一个条件开始的天数. life 是尝试计算的列.知道如何计算吗?是 Window 还是 lag ?

Objective is to calculate the date difference (# of days) up to when the condition=1 then the date difference resets to # of days starting from when the last condition was met. life is the column trying to calculate. Any idea how to calculate this? Window or lag?

推荐答案

这是一种类型的问题,可以通过添加一些临时行(我们将其标记然后在以后将其删除)来简化.

This is one type of questions which can be simplified by adding some temporary rows (we flag them and then remove them later)

from pyspark.sql import Window
from pyspark.sql.functions import lit, lag, sum as fsum, first, datediff

(1)首先,创建一个新的数据帧df1,该数据帧复制条件== 1的所有行,但将其条件设置为0且标志= 1,将所得数据帧与原始数据帧(设置标志= 0)合并:>

(1) First, create a new dataframe df1 which replicates all Rows with condition == 1 but set their condition = 0 and flag = 1, union the resulting dataframe with the original dataframe(set flag = 0):

df1 = df.withColumn('flag', lit(0)).union(
    df.where('condition = 1').withColumn('condition', lit(0)).withColumn('flag', lit(1))
)

(2)然后,设置以下两个窗口规范,使用 w1 帮助创建一个子组标签 g 将所有连续的行分组,直到切换条件为止从1到0.将 flag 添加到orderBy()中,以便新添加的行位于条件= 1的相应行的后面,并被分组到下一个组标签中.

(2) Then, set up the following two Window Specs, use w1 to help create a sub-group label g to group all consecutive rows until condition is switched from 1 to 0. add flag into orderBy() so that the newly added Rows sit right behind their corresponding row with condition = 1 and are grouped into the next group-label.

w1 = Window.partitionBy(lit(0)).orderBy('date', 'flag')
w2 = Window.partitionBy(lit(0), 'g').orderBy('date', 'flag')

注意::如果您有一个巨大的数据框,则可能需要将 lit(0)更改为一些实际或计算的列,以避免Spark将所有行移动到一个单个分区.更新:基于注释,数据帧是单个时间序列,可以加载到单个分区上,因此使用 lit(0)应该足够.

Note: In case you have a huge dataframe, you might want to change lit(0) to some actual or calculated columns to avoid Spark moving all rows onto a single partition. UPDATE: Based on comments, the dataframe is a single time-series which can be loaded onto a single partition, thus using lit(0) should be enough.

(3)在 w1 上使用滞后和求和函数来找到子组标签'g',然后使用WindowSpec w2 计算同一组中的first_date.此日期用于计算寿命"列:

(3) use lag and sum function over w1 to find the sub-group label 'g' and then calculate the first_date in the same group use WindowSpec w2. this date is used to calculate the column 'life':

df2 = df1.withColumn('g', fsum((lag('condition').over(w1) == 1).astype('int')).over(w1)) \
    .withColumn('first_date', first('date').over(w2)) \
    .withColumn('life', datediff('date','first_date'))
df2.show()
+----------+------+---------+----+---+----------+----+
|      date|device|condition|flag|  g|first_date|life|
+----------+------+---------+----+---+----------+----+
|2020-01-12|    d1|        0|   0|  0|2020-01-12|   0|
|2020-01-12|    d2|        0|   0|  0|2020-01-12|   0|
|2020-01-13|    d3|        0|   0|  0|2020-01-12|   1|
|2020-01-14|    d4|        1|   0|  0|2020-01-12|   2|
|2020-01-14|    d4|        0|   1|  1|2020-01-14|   0|
|2020-01-15|    d5|        0|   0|  1|2020-01-14|   1|
|2020-01-15|    d6|        0|   0|  1|2020-01-14|   1|
|2020-01-16|    d7|        0|   0|  1|2020-01-14|   2|
|2020-01-17|    d8|        0|   0|  1|2020-01-14|   3|
|2020-01-18|    d9|        1|   0|  1|2020-01-14|   4|
|2020-01-18|    d9|        0|   1|  2|2020-01-18|   0|
|2020-01-19|   d10|        0|   0|  2|2020-01-18|   1|
|2020-01-20|   d11|        0|   0|  2|2020-01-18|   2|
+----------+------+---------+----+---+----------+----+

(4)删除临时行和列以获得最终数据框:

(4) drop temporary rows and columns to get the final dataframe:

df_new = df2.filter('flag = 0').drop('first_date', 'g', 'flag')
df_new.show()
+----------+------+---------+----+
|      date|device|condition|life|
+----------+------+---------+----+
|2020-01-12|    d1|        0|   0|
|2020-01-12|    d2|        0|   0|
|2020-01-13|    d3|        0|   1|
|2020-01-14|    d4|        1|   2|
|2020-01-15|    d5|        0|   1|
|2020-01-15|    d6|        0|   1|
|2020-01-16|    d7|        0|   2|
|2020-01-17|    d8|        0|   3|
|2020-01-18|    d9|        1|   4|
|2020-01-19|   d10|        0|   1|
|2020-01-20|   d11|        0|   2|
+----------+------+---------+----+

这篇关于如何计算上一次满足条件之间的天数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆