在每个累积最大值内填充第一行 [英] Fill down first row within each cumulative max, with a twist
问题描述
我有按日期"排序的数据,其中有些值为"x",其累积最大值为"cmx".我想创建一列"max_date",它是每个累积最大值内第一行的日期.稍作改动:当有多个"x"值等于当前的累积最大值时,应为这些行中的每一行选择一个新的最大日期".
I have data ordered by 'date', with some values 'x' and their cumulative maximum 'cmx'. I want to create a column 'max_date' which is the date of the first row within each cumulative max. With a small twist: when there are several 'x' values which are equal to the current cumulative max, a new 'max date' should be selected for each of these rows.
注释了两个不同的cummax的某些数据:
Some data where two different cummax are annotated:
d = structure(list(date = structure(c(18690, 18691, 18692, 18693, 18694, 18695, 18696, 18697), class = "Date"),
x = c(18, 70, 57, 94, 94, 13, 98, 23),
cmx = c(18, 70, 70, 94, 94, 94, 98, 98)),
row.names = c(NA, -8L), class = c("data.table", "data.frame"))
d
# date x cmx
# 1: 2021-03-04 18 18
# 2: 2021-03-05 70 70 # first row of cummax 70: select this date for row 2 & 3
# 3: 2021-03-06 57 70 #
# 4: 2021-03-07 94 94 # first row of cummax 94
# 5: 2021-03-08 94 94 # x is equal to cummax 94 again!
# 6: 2021-03-09 13 94 # I.e. row 5 is a 'new' first date to be used for row 5 & 6
# 7: 2021-03-10 98 98
# 8: 2021-03-11 23 98
因此,理想的结果:
# date x cmx max_date
# 1: 2021-03-04 18 18 2021-03-04
# 2: 2021-03-05 70 70 2021-03-05
# 3: 2021-03-06 57 70 2021-03-05
# 4: 2021-03-07 94 94 2021-03-07
# 5: 2021-03-08 94 94 2021-03-08
# 6: 2021-03-09 13 94 2021-03-08
# 7: 2021-03-10 98 98 2021-03-10
# 8: 2021-03-11 23 98 2021-03-10
我以为我可以使用 frollapply
,但是无法获得滚动窗口来查看之前的所有行.
I thought I could use frollapply
but couldn't get the rolling window to look at all previous rows.
推荐答案
使用 cumsum(x == cmx)
创建可分隔连续行的组,其中"x"等于 cummax(x)
.在每个组中,将"max_date"设置为第一个"date".
Use cumsum(x == cmx)
to create groups which separate consecutive rows where 'x' equals cummax(x)
. Within each group, set 'max_date' to first 'date'.
d[ , max_date := date[1], by = cumsum(x == cmx)]
d
# date x cmx max_date
# 1: 2021-03-04 18 18 2021-03-04
# 2: 2021-03-05 70 70 2021-03-05
# 3: 2021-03-06 57 70 2021-03-05
# 4: 2021-03-07 94 94 2021-03-07
# 5: 2021-03-08 94 94 2021-03-08
# 6: 2021-03-09 13 94 2021-03-08
# 7: 2021-03-10 98 98 2021-03-10
# 8: 2021-03-11 23 98 2021-03-10
避免使用 by
(在某些情况下 可能更快)的第二种选择:
A second alternative which avoids using by
(may be faster in some situations):
d[rowid(cumsum(x == cmx)) == 1, max_date := date]
d[ , max_date := nafill(max_date, type = "locf")]
使用 cumsum(x == cmx)
创建用于分隔连续行的组,其中"x"等于 cummax(x)
(否则, rowid(cmx)
就足够了).使用 rowid
在组内创建一个计数器.对于每个组中的第一行( rowid == 1
),将'max_date'设置为'date'.使用 nafill
和'last-observation-carried-forward'填充 NA
行.
Use cumsum(x == cmx)
to create groups which separate consecutive rows where 'x' equals cummax(x)
(otherwise rowid(cmx)
would have been enough). Use rowid
to create a counter within groups. For the first row within each group (rowid == 1
), set 'max_date' to 'date'. Use nafill
with 'last-observation-carried-forward' to fill NA
rows.
这篇关于在每个累积最大值内填充第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!