如何使用Spark Windowing根据条件查找数据帧的第一行和第n行之间的差异 [英] How to find the difference between 1st row and nth row of a dataframe based on a condition using Spark Windowing
问题描述
这是我的确切要求.我必须添加一个名为("DAYS_TO_NEXT_PD_ENCOUNTER")的新列.顾名思义,新列中的值应与RANK的差异与当前行的RANK的Claim_typ为"PD".对于一个ID,它可以出现在"RV"和"RJ"之间.对于第一次出现Claim_typ之后显示为"PD"的行,差异应为null,如下所示:
Here is my exact requirement. I have to add a new column named ("DAYS_TO_NEXT_PD_ENCOUNTER"). As the name indicates, the values in the new column should have a difference of RANK that has claim_typ as 'PD' and the current row. For one ID, it can occur in-between any of the 'RV's and 'RJ's. For the rows that are present after the first occurence of claim_typ as 'PD', the difference should be null as shown below:
如果clm_typ'PD'作为最后一个元素出现,则API'last'有效.并非总是如此.对于一个ID,它可以出现在"RV"和"RJ"之间.
The API 'last' works if the clm_typ 'PD' occurs as the last element. It will not be the case always. For one ID, it can occur in-between any of the 'RV's and 'RJ's.
+----------+--------+---------+----+-------------------------+
| ID | WEEK_ID|CLAIM_TYP|RANK|DAYS_TO_NEXT_PD_ENCOUNTER|
+----------+--------+---------+----+-------------------------+
| 30641314|20180209| RV| 1| 5|
| 30641314|20180209| RJ| 2| 4|
| 30641314|20180216| RJ| 3| 3|
| 30641314|20180216| RJ| 4| 2|
| 30641314|20180216| RJ| 5| 1|
| 30641314|20180216| PD| 6| 0|
| 48115882|20180209| RV| 1| 3|
| 48115882|20180209| RV| 2| 2|
| 48115882|20180209| RV| 3| 1|
| 48115882|20180209| PD| 4| 0|
| 48115882|20180216| RJ| 5| null|
| 48115882|20180302| RJ| 6| null|
+----------+--------+---------+----+-------------------------+
+----------+--------+---------+----+-------------------------+
| ID | WEEK_ID|CLAIM_TYP|RANK|DAYS_TO_NEXT_PD_ENCOUNTER|
+----------+--------+---------+----+-------------------------+
| 30641314|20180209| RV| 1| 5|
| 30641314|20180209| RJ| 2| 4|
| 30641314|20180216| RJ| 3| 3|
| 30641314|20180216| RJ| 4| 2|
| 30641314|20180216| RJ| 5| 1|
| 30641314|20180216| PD| 6| 0|
| 48115882|20180209| RV| 1| 3|
| 48115882|20180209| RV| 2| 2|
| 48115882|20180209| RV| 3| 1|
| 48115882|20180209| PD| 4| 0|
| 48115882|20180216| RJ| 5| null|
| 48115882|20180302| RJ| 6| null|
+----------+--------+---------+----+-------------------------+
推荐答案
此处显示的是PySpark解决方案.
Shown here is a PySpark solution.
您可以将条件聚集与max(when...))
一起使用,以获取与第一行"PD"的必要等级差异.得到差异后,使用when...
来null
排负序的行,因为它们都出现在第一行"PD"之后.
You can use conditional aggregation with max(when...))
to get the necessary difference of ranks with the first 'PD' row. After getting the difference, use a when...
to null
out rows with negative ranks as they all occur after the first 'PD' row.
# necessary imports
w1 = Window.partitionBy(df.id).orderBy(df.svc_dt)
df = df.withColumn('rnum',row_number().over(w1))
w2 = Window.partitionBy(df.id)
res = df.withColumn('diff_pd_rank',max(when(df.clm_typ == 'PD',df.rnum)).over(w2) - rnum)
res = res.withColumn('days_to_next_pd_encounter',when(res.diff_pd_rank >= 0,res.diff_pd_rank))
res.show()
这篇关于如何使用Spark Windowing根据条件查找数据帧的第一行和第n行之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!