在每个累积最大值内填充第一行 [英] Fill down first row within each cumulative max, with a twist

查看:39
本文介绍了在每个累积最大值内填充第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有按日期"排序的数据,其中有些值为"x",其累积最大值为"cmx".我想创建一列"max_date",它是每个累积最大值内第一行的日期.稍作改动:当有多个"x"值等于当前的累积最大值时,应为这些行中的每一行选择一个新的最大日期".

I have data ordered by 'date', with some values 'x' and their cumulative maximum 'cmx'. I want to create a column 'max_date' which is the date of the first row within each cumulative max. With a small twist: when there are several 'x' values which are equal to the current cumulative max, a new 'max date' should be selected for each of these rows.

注释了两个不同的cummax的某些数据:

Some data where two different cummax are annotated:

d = structure(list(date = structure(c(18690, 18691, 18692, 18693, 18694, 18695, 18696, 18697), class = "Date"),
                   x = c(18, 70, 57, 94, 94, 13, 98, 23),
                   cmx = c(18, 70, 70, 94, 94, 94, 98, 98)),
                   row.names = c(NA, -8L), class = c("data.table", "data.frame"))
d
#          date  x cmx
# 1: 2021-03-04 18  18
# 2: 2021-03-05 70  70 # first row of cummax 70: select this date for row 2 & 3 
# 3: 2021-03-06 57  70 # 
# 4: 2021-03-07 94  94 # first row of cummax 94
# 5: 2021-03-08 94  94 # x is equal to cummax 94 again! 
# 6: 2021-03-09 13  94 # I.e. row 5 is a 'new' first date to be used for row 5 & 6  
# 7: 2021-03-10 98  98
# 8: 2021-03-11 23  98

因此,理想的结果:

#          date  x cmx   max_date
# 1: 2021-03-04 18  18 2021-03-04 
# 2: 2021-03-05 70  70 2021-03-05
# 3: 2021-03-06 57  70 2021-03-05
# 4: 2021-03-07 94  94 2021-03-07
# 5: 2021-03-08 94  94 2021-03-08
# 6: 2021-03-09 13  94 2021-03-08
# 7: 2021-03-10 98  98 2021-03-10
# 8: 2021-03-11 23  98 2021-03-10 

我以为我可以使用 frollapply ,但是无法获得滚动窗口来查看之前的所有行.

I thought I could use frollapply but couldn't get the rolling window to look at all previous rows.

推荐答案

使用 cumsum(x == cmx)创建可分隔连续行的组,其中"x"等于 cummax(x).在每个组中,将"max_date"设置为第一个"date".

Use cumsum(x == cmx) to create groups which separate consecutive rows where 'x' equals cummax(x). Within each group, set 'max_date' to first 'date'.

d[ , max_date := date[1], by = cumsum(x == cmx)]
d
#          date  x cmx   max_date
# 1: 2021-03-04 18  18 2021-03-04
# 2: 2021-03-05 70  70 2021-03-05
# 3: 2021-03-06 57  70 2021-03-05
# 4: 2021-03-07 94  94 2021-03-07
# 5: 2021-03-08 94  94 2021-03-08
# 6: 2021-03-09 13  94 2021-03-08
# 7: 2021-03-10 98  98 2021-03-10
# 8: 2021-03-11 23  98 2021-03-10


避免使用 by (在某些情况下 可能更快)的第二种选择:


A second alternative which avoids using by (may be faster in some situations):

d[rowid(cumsum(x == cmx)) == 1, max_date := date]
d[ , max_date := nafill(max_date, type = "locf")]

使用 cumsum(x == cmx)创建用于分隔连续行的组,其中"x"等于 cummax(x)(否则, rowid(cmx)就足够了).使用 rowid 在组内创建一个计数器.对于每个组中的第一行( rowid == 1 ),将'max_date'设置为'date'.使用 nafill 和'last-observation-carried-forward'填充 NA 行.

Use cumsum(x == cmx) to create groups which separate consecutive rows where 'x' equals cummax(x) (otherwise rowid(cmx) would have been enough). Use rowid to create a counter within groups. For the first row within each group (rowid == 1), set 'max_date' to 'date'. Use nafill with 'last-observation-carried-forward' to fill NA rows.

这篇关于在每个累积最大值内填充第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆