根据带有条件的时间戳获取数据帧中的最新记录 [英] Get latest records in a data frame based on time stamp with condition

查看:74
本文介绍了根据带有条件的时间戳获取数据帧中的最新记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题标题可能不准确,但我希望我能够解释我的问题 所以我有一个像下面的数据框

My Question heading might not be accurate but i hope i will be able to explain my question So i have a data frame like below

DataPartition_1|^|PartitionYear_1|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber_1|^|FFAction_1
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|5|^|21|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629599|^|4295858941|^|1|^|22|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629600|^|4295858941|^|1|^|23|^|2|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629624|^|4295858941|^|null|^|35|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629626|^|4295858941|^|null|^|37|^|null|^|D|!|
SelfSourcedPublic|^|2001|^|1510044629596|^|4295858941|^|19|^|5|^|1|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629597|^|4295858941|^|20|^|5|^|2|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|21|^|5|^|2|^|I|!|

所以我数据框的主键是

OrganizationId", "AnnualPeriodId","InterimPeriodId"

下面是我的代码,该代码根据时间戳获取最新记录,并按主键seq排列.

below is my code to get the latest records based on time stamp and arranged by seq of primary key .

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

现在我的问题是有些时候我在某些主键列(例如带有时间戳记1510044629624的记录)中为空.

Now my issue is some time i get null in some of the primary key column like record with time stamp 1510044629624.

现在我的要求是,下面的记录具有相同的主键,但第一个记录为null.在这种情况下,我仍然只需要一个具有最新时间戳的记录

Now my requirement is that below records have same primary key except first one has null .In this case i still need only one records with latest time stamp

    SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

我应该得到SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

由于空..

我希望我的问题很清楚.

I hope my question is clear.

推荐答案

据我所知,您正在使用一个额外的列作为主键.

From what I understood from your question and problem is that you are using an extra column as primary key.

AnnualPeriodId列正在获取null,并且由于您正在使用partitionBy中的该字段,因此导致null成为单独的组,从而导致单独的row

AnnualPeriodId column is getting null and since you are using that field in partitionBy, it is causing the null to be separate group and thus separate row

val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 

因此解决方案是将其从partitionBy中删除,以使上面的行变为

So the solution is to remove it from the partitionBy so that the above line becomes

val windowSpec = Window.partitionBy("OrganizationId", "InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 

我希望这能解决您遇到的问题.

I hope this should solve the issue you are having.

这篇关于根据带有条件的时间戳获取数据帧中的最新记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆