计算行之间的日期差 [英] Calculate difference in date between rows

查看:77
本文介绍了计算行之间的日期差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table,我想在其中标记在给定组ID的先前条目的90天内的条目。上下文是这些是交易的购买信号。因此,我不希望在90天的时间范围内重复工作,因为我假设我持有该职位90天,因此本来已经买了一个职位(并且我不想重新启动时钟)。

I have a data.table where I would like to flag entries that are within 90 days of a prior entry for a given group id. Context is that these are buy signals for trades. So I don't want duplicates within a 90 day window, since I am assuming I hold the position for 90 days, and thus would already have bought a position (and I don't want to restart the clock).

所以我有:

library(data.table)
> dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
> dt
   id       date
1:  A 2017-01-01
2:  A 2017-02-01
3:  A 2017-05-01
4:  B 2017-01-01
5:  B 2017-05-01
6:  B 2017-10-01
7:  C 2017-01-01
8:  C 2017-02-01
9:  C 2017-02-15

我想得到:

> dt2
   id       date with.90d
1:  A 2017-01-01    FALSE
2:  A 2017-02-01     TRUE
3:  A 2017-05-01    FALSE
4:  B 2017-01-01    FALSE
5:  B 2017-05-01    FALSE
6:  B 2017-10-01    FALSE
7:  C 2017-01-01    FALSE
8:  C 2017-02-01     TRUE
9:  C 2017-02-15     TRUE

我觉得我应该可以使用.SD做到这一点,但是我不太清楚。谢谢您的帮助!

I feel like I should be able to do this with .SD, but I can't quite figure it out. Thanks for any help!

推荐答案

OP具有

The OP has requested:


从每个组的第一个观察值X,我想标记距离X不到90天的任何其他观测值。然后对于距离X大于90天的第二天观测值,将其称为观测值Y,我想标记Y的90天内的任何观测值。重复。

From first observation X in each group, I want to flag any other observations less than 90 days away from X. Then for the next day observation that is greater than 90 days away from X, call it observation Y, I want to flag any observations within 90 days of Y. Repeat.

如果我从预期结果中正确理解,则中的 FALSE 值with.90d 列表示90天周期的开始。

If I understood correctly from the expected result, a value of FALSE in the with.90d column indicates the beginn of 90 day period.

不幸的是,下一个 90天周期的开始取决于前一个90天的期限到期后下一次观察的日期。因此,我们不能使用从每个组中的第一个日期开始的固定90天间隔。

Unfortunately, the beginn of the next 90 day period depends on the date of the next observation after expiration of the previous 90 day period. Therefore, we cannot use fixed 90 day intervals starting with the first date in each group.

我试图使用 non-equi joins 滚动连接,但到目前为止我还是采用了递归方法:

I have tried to find solutions using non-equi joins or rolling joins but I ended up so far with a recursive approach:

dt3[, with.90d := NA]
while (dt3[, any(is.na(with.90d))]) 
  dt3[is.na(with.90d), cd := date - min(date), by = id][
    is.na(with.90d) & cd == 0, with.90d := FALSE][
      is.na(with.90d) & cd <= 90, with.90d := TRUE]
dt3



    id       date with.90d      cd
 1:  A 2017-01-01    FALSE  0 days
 2:  A 2017-02-01     TRUE 31 days
 3:  A 2017-05-01    FALSE  0 days
 4:  B 2017-01-01    FALSE  0 days
 5:  B 2017-05-01    FALSE  0 days
 6:  B 2017-10-01    FALSE  0 days
 7:  C 2017-01-01    FALSE  0 days
 8:  C 2017-02-01     TRUE 31 days
 9:  C 2017-02-15     TRUE 45 days
10:  D 2017-03-01    FALSE  0 days
11:  D 2017-04-01     TRUE 31 days
12:  D 2017-05-01     TRUE 61 days
13:  D 2017-06-01    FALSE  0 days
14:  D 2017-07-01     TRUE 30 days
15:  D 2017-08-01     TRUE 61 days
16:  E 2017-01-01    FALSE  0 days
17:  E 2017-02-01     TRUE 31 days
18:  E 2017-03-01     TRUE 59 days
19:  E 2017-04-01     TRUE 90 days
20:  E 2017-05-01    FALSE  0 days
21:  E 2017-06-01     TRUE 31 days
    id       date with.90d      cd


请注意,我还追加了两个组, D E 到OP的样本数据集中,以便更好地验证该方法。另请注意,组 D 2017-03-01 开始的结果有何不同E ,始于 2017-01-01

Note that I have appended two more groups, D and E to OP's sample dataset in order to better verify the approach. Also note how the result differs for group D which starts on 2017-03-01 and E which starts on 2017-01-01.

只要在 with.90d 中有 NA 个值,就对重复以下顺序仅 NA (行为 TRUE FALSE 值已完成)

As long as there are NA values in with.90d, the following sequence is repeated for the NA rows only (rows with TRUE or FALSE values are completed):


  • 计算与每个组中第一个日期的天差。请注意,使用 min(date)也可用于无序数据集。或者, setorder(dt3,date) first(date)(或 date [1 ] )。

  • 日差为 0 的行表示新的开始时间段,并标记为 FALSE

  • 行距相差小于或等于90天的行标记为 TRUE

  • 所有其他行均保持不变,即保持 NA 的值。
  • li>
  • Compute the day differences to the first date in each group. Note that min(date) is used which works also with an unordered dataset. Alternatively, setorder(dt3, date) and first(date) (or date[1]) can be used.
  • Rows with a day difference of 0 indicate the start of a new period and are flagged as FALSE.
  • Rows with a day difference of less or equal 90 days are flagged as TRUE.
  • All other rows remain untouched, i.e., they keep the NA value.

为了便于说明,我将帮助器列保留为 cd 。可以通过 dt3 [,cd:= NULL] 删除。

For illustration I have kept the helper column cd. It can be removed by dt3[, cd := NULL].

# OP's sample dataset
dt <- data.table(id = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), 
                 date = as.Date(c("2017-01-01", "2017-02-01", "2017-05-01", "2017-01-01", "2017-05-01", "2017-10-01", "2017-01-01", "2017-02-01", "2017-02-15")))
# append group D
dt2 <- dt[, .(id = c(id, rep("D", 6)), 
              date = c(date, seq(as.Date("2017-03-01"), length.out = 6, by = "1 month")))]
# append group E
dt3 <- dt2[, .(id = c(id, rep("E", 6)), 
               date = c(date, seq(as.Date("2017-01-01"), length.out = 6, by = "1 month")))]

这篇关于计算行之间的日期差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆