根据带条件的时间戳获取数据框中的最新记录 [英] Get latest records in a data frame based on time stamp with condition

查看:59
本文介绍了根据带条件的时间戳获取数据框中的最新记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题标题可能不准确,但我希望我能够解释我的问题所以我有一个像下面这样的数据框

My Question heading might not be accurate but i hope i will be able to explain my question So i have a data frame like below

DataPartition_1|^|PartitionYear_1|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber_1|^|FFAction_1
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|5|^|21|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629599|^|4295858941|^|1|^|22|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629600|^|4295858941|^|1|^|23|^|2|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629624|^|4295858941|^|null|^|35|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629626|^|4295858941|^|null|^|37|^|null|^|D|!|
SelfSourcedPublic|^|2001|^|1510044629596|^|4295858941|^|19|^|5|^|1|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629597|^|4295858941|^|20|^|5|^|2|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|21|^|5|^|2|^|I|!|

所以我的数据框的主键是

So the primary key for my data frame is

OrganizationId", "AnnualPeriodId","InterimPeriodId"

下面是我根据时间戳获取最新记录并按主键序列排列的代码.

below is my code to get the latest records based on time stamp and arranged by seq of primary key .

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

现在我的问题是有一段时间我在一些主键列中得到空值,比如带有时间戳的记录1510044629624.

Now my issue is some time i get null in some of the primary key column like record with time stamp 1510044629624.

现在我的要求是下面的记录有相同的主键,除了第一个有 null .在这种情况下,我仍然只需要一个带有最新时间戳的记录

Now my requirement is that below records have same primary key except first one has null .In this case i still need only one records with latest time stamp

    SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

我应该得到 SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

由于 null ..

我希望我的问题很清楚.

I hope my question is clear.

推荐答案

根据我从您的问题和问题中了解到的,您正在使用额外的列作为主键.

From what I understood from your question and problem is that you are using an extra column as primary key.

AnnualPeriodId 列正在获取 null 并且由于您在 partitionBy 中使用该字段,它导致 null 成为单独的组,因此单独的 row

AnnualPeriodId column is getting null and since you are using that field in partitionBy, it is causing the null to be separate group and thus separate row

val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 

所以解决办法就是把它从partitionBy中去掉,这样上面的那行就变成

So the solution is to remove it from the partitionBy so that the above line becomes

val windowSpec = Window.partitionBy("OrganizationId", "InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 

我希望这能解决您遇到的问题.

I hope this should solve the issue you are having.

这篇关于根据带条件的时间戳获取数据框中的最新记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆