计算行之间的日期差 [英] Calculate difference in date between rows
问题描述
我有一个data.table,我想在其中标记在给定组ID的先前条目的90天内的条目。上下文是这些是交易的购买信号。因此,我不希望在90天的时间范围内重复工作,因为我假设我持有该职位90天,因此本来已经买了一个职位(并且我不想重新启动时钟)。
I have a data.table where I would like to flag entries that are within 90 days of a prior entry for a given group id. Context is that these are buy signals for trades. So I don't want duplicates within a 90 day window, since I am assuming I hold the position for 90 days, and thus would already have bought a position (and I don't want to restart the clock).
所以我有:
library(data.table)
> dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
> dt
id date
1: A 2017-01-01
2: A 2017-02-01
3: A 2017-05-01
4: B 2017-01-01
5: B 2017-05-01
6: B 2017-10-01
7: C 2017-01-01
8: C 2017-02-01
9: C 2017-02-15
我想得到:
> dt2
id date with.90d
1: A 2017-01-01 FALSE
2: A 2017-02-01 TRUE
3: A 2017-05-01 FALSE
4: B 2017-01-01 FALSE
5: B 2017-05-01 FALSE
6: B 2017-10-01 FALSE
7: C 2017-01-01 FALSE
8: C 2017-02-01 TRUE
9: C 2017-02-15 TRUE
我觉得我应该可以使用.SD做到这一点,但是我不太清楚。谢谢您的帮助!
I feel like I should be able to do this with .SD, but I can't quite figure it out. Thanks for any help!
推荐答案
OP具有:
The OP has requested:
从每个组的第一个观察值X,我想标记距离X不到90天的任何其他观测值。然后对于距离X大于90天的第二天观测值,将其称为观测值Y,我想标记Y的90天内的任何观测值。重复。
From first observation X in each group, I want to flag any other observations less than 90 days away from X. Then for the next day observation that is greater than 90 days away from X, call it observation Y, I want to flag any observations within 90 days of Y. Repeat.
如果我从预期结果中正确理解,则中的
列表示90天周期的开始。 FALSE
值with.90d
If I understood correctly from the expected result, a value of FALSE
in the with.90d
column indicates the beginn of 90 day period.
不幸的是,下一个 90天周期的开始取决于前一个90天的期限到期后下一次观察的日期。因此,我们不能使用从每个组中的第一个日期开始的固定90天间隔。
Unfortunately, the beginn of the next 90 day period depends on the date of the next observation after expiration of the previous 90 day period. Therefore, we cannot use fixed 90 day intervals starting with the first date in each group.
我试图使用 non-equi joins 或滚动连接,但到目前为止我还是采用了递归方法:
I have tried to find solutions using non-equi joins or rolling joins but I ended up so far with a recursive approach:
dt3[, with.90d := NA]
while (dt3[, any(is.na(with.90d))])
dt3[is.na(with.90d), cd := date - min(date), by = id][
is.na(with.90d) & cd == 0, with.90d := FALSE][
is.na(with.90d) & cd <= 90, with.90d := TRUE]
dt3
id date with.90d cd
1: A 2017-01-01 FALSE 0 days
2: A 2017-02-01 TRUE 31 days
3: A 2017-05-01 FALSE 0 days
4: B 2017-01-01 FALSE 0 days
5: B 2017-05-01 FALSE 0 days
6: B 2017-10-01 FALSE 0 days
7: C 2017-01-01 FALSE 0 days
8: C 2017-02-01 TRUE 31 days
9: C 2017-02-15 TRUE 45 days
10: D 2017-03-01 FALSE 0 days
11: D 2017-04-01 TRUE 31 days
12: D 2017-05-01 TRUE 61 days
13: D 2017-06-01 FALSE 0 days
14: D 2017-07-01 TRUE 30 days
15: D 2017-08-01 TRUE 61 days
16: E 2017-01-01 FALSE 0 days
17: E 2017-02-01 TRUE 31 days
18: E 2017-03-01 TRUE 59 days
19: E 2017-04-01 TRUE 90 days
20: E 2017-05-01 FALSE 0 days
21: E 2017-06-01 TRUE 31 days
id date with.90d cd
请注意,我还追加了两个组, D
和 E
到OP的样本数据集中,以便更好地验证该方法。另请注意,组 D
从 2017-03-01
和开始的结果有何不同E
,始于 2017-01-01
。
Note that I have appended two more groups, D
and E
to OP's sample dataset in order to better verify the approach. Also note how the result differs for group D
which starts on 2017-03-01
and E
which starts on 2017-01-01
.
只要在 with.90d
中有 NA
个值,就对重复以下顺序仅 NA
行(行为 TRUE
或 FALSE $ c的行$ c>值已完成)
As long as there are NA
values in with.90d
, the following sequence is repeated for the NA
rows only (rows with TRUE
or FALSE
values are completed):
- 计算与每个组中第一个日期的天差。请注意,使用
min(date)
也可用于无序数据集。或者,setorder(dt3,date)
和first(date)
(或date [1 ]
)。 - 日差为
0
的行表示新的开始时间段,并标记为FALSE
。 - 行距相差小于或等于90天的行标记为
TRUE
。 - 所有其他行均保持不变,即保持
NA
的值。 li>
- Compute the day differences to the first date in each group. Note that
min(date)
is used which works also with an unordered dataset. Alternatively,setorder(dt3, date)
andfirst(date)
(ordate[1]
) can be used. - Rows with a day difference of
0
indicate the start of a new period and are flagged asFALSE
. - Rows with a day difference of less or equal 90 days are flagged as
TRUE
. - All other rows remain untouched, i.e., they keep the
NA
value.
为了便于说明,我将帮助器列保留为 cd
。可以通过 dt3 [,cd:= NULL]
删除。
For illustration I have kept the helper column cd
. It can be removed by dt3[, cd := NULL]
.
# OP's sample dataset
dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
# append group D
dt2 <- dt[, .(id = c(id, rep("D", 6)),
date = c(date, seq(as.Date("2017-03-01"), length.out = 6, by = "1 month")))]
# append group E
dt3 <- dt2[, .(id = c(id, rep("E", 6)),
date = c(date, seq(as.Date("2017-01-01"), length.out = 6, by = "1 month")))]
这篇关于计算行之间的日期差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!