将观察结果与重叠的日期结合起来 [英] Combining observations with overlapping dates

查看:71
本文介绍了将观察结果与重叠的日期结合起来的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据框中的每个观察值都包含一个不同的之前日期和之后日期实例。问题是每个ID的日期重叠。例如,在下表中,ID的1和4包含重叠的日期值。

Each observations in my dataframe contains a different "before date" and "after date instance". The problem is some dates overlap for each ID. For instance, in the table below, ID's 1 and 4 contain overlapping date values.

ID  before date after date
1   10/1/1996   12/1/1996
1   1/1/1998    9/30/2003
1   1/1/2000    12/31/2004
2   1/1/2001    3/31/2006
3   1/1/2001    9/30/2006
4   1/1/2001    9/30/2005
4   10/1/2004   12/30/2004
4   10/3/2004   11/28/2004

我正在尝试获取像这样的东西:

I am trying to get something like this:

ID  before date after date
1   10/1/1996   12/1/1996
1   1/1/1998    12/31/2004
2   1/1/2001    3/31/2006
3   1/1/2001    9/30/2006
4   1/1/2001    9/30/2005

基本上,我想将任何重叠的日期值替换为重叠的值的日期范围,不保留不重叠的值,并删除所有不必要的行。不确定如何执行此操作

Basically, I would like to replace any overlapping date values with the date range of the values with the overlap, leave the non-overlapping values alone, and delete any unnecessary rows. Not sure how to go about doing this

推荐答案

首先,您应该将字符串日期转换为 Date 分类的值,这将使比较成为可能。这是我定义和强制您的数据的方式:

Firstly, you should convert your string dates into Date-classed values, which will make comparison possible. Here's how I've defined and coerced your data:

df <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') );
dcis <- grep('date$',names(df));
df[dcis] <- lapply(df[dcis],as.Date,'%m/%d/%Y');
df;
##   ID before.date after.date
## 1  1  1996-10-01 1996-12-01
## 2  1  1998-01-01 2003-09-30
## 3  1  2000-01-01 2004-12-31
## 4  2  2001-01-01 2006-03-31
## 5  3  2001-01-01 2006-09-30
## 6  4  2001-01-01 2005-09-30
## 7  4  2004-10-01 2004-12-30
## 8  4  2004-10-03 2004-11-28

现在,我的解决方案涉及计算重叠分组向量叫 og 。它假设输入 df 是按 ID 然后是 before.date ,它在您的示例数据中。如果没有,可以通过 df [order(df $ ID,df $ before.date),] 来实现。这是我计算 og 的方法:

Now, my solution involves computing an "overlapping grouping" vector which I've called og. It makes the assumption that the input df is ordered by ID and then before.date, which it is in your example data. If not, this could be achieved by df[order(df$ID,df$before.date),]. Here's how I compute og:

cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4

不幸的是,基数R cummax()函数不适用于 Date 分类的对象,因此我不得不编写一个 cummax.Date()垫片。我将在帖子末尾说明 ave() cummax()业务的需求。

Unfortunately, the base R cummax() function doesn't work on Date-classed objects, so I had to write a cummax.Date() shim. I'll explain the need for the ave() and cummax() business at the end of the post.

如您所见,上述计算通过 [-1]排除了第一个元素,从而使两个向量化比较中的每一个的RHS滞后code>。这使我们可以比较记录的 ID 与以下记录的 ID 是否相等,还可以比较记录的 ID 是否相等。 $ c> after.date 在以下记录的 before.date 之后。所得的逻辑向量将进行与运算(& )。逻辑向量的取反表示 not 不重叠的相邻记录对,因此我们可以 cumsum()结果(并在前面加上零,因为第一条记录必须以零开头)以获取我们的分组向量。

As you can see, the above computation lags the RHS of each of the two vectorized comparisons by excluding the first element via [-1]. This allows us to compare a record's ID for equality with the following record's ID, and also compare if its after.date is after the before.date of the following record. The resulting logical vectors are ANDed (&) together. The negation of that logical vector then represents adjacent pairs of records that do not overlap, and thus we can cumsum() the result (and prepend zero, as the first record must start with zero) to get our grouping vector.

最后,对于解决方案的最后一部分,我使用了 by()与每个重叠的组独立工作:

Finally, for the final piece of the solution, I've used by() to work with each overlapping group independently:

do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=max(g$after.date))));
##   ID before.date after.date
## 0  1  1996-10-01 1996-12-01
## 1  1  1998-01-01 2004-12-31
## 2  2  2001-01-01 2006-03-31
## 3  3  2001-01-01 2006-09-30
## 4  4  2001-01-01 2005-09-30

由于组中的所有记录必须具有相同的 ID ,我们假设记录按 before.date 排序(在按 ID ,它不再相关),我们可以从中获取正确的 ID before.date 值组中的第一条记录。这就是为什么我从 g [1,] 开始的原因。然后,我们只需要通过 max(g $ after.date)从组中获取最大的 after.date ,并用它覆盖第一条记录的 after.date ,这是我用 transform()完成的。

Since all records in a group must have the same ID, and we've made the assumption that records are ordered by before.date (after being ordered by ID, which is no longer relevant), we can get the correct ID and before.date values from the first record in the group. That's why I started with g[1,]. Then we just need to get the greatest after.date from the group via max(g$after.date), and overwrite the first record's after.date with that, which I've done with transform().

关于性能的说法:关于排序的假设有助于性能,因为它使我们能够通过滞后矢量化比较将每个记​​录与紧随其后的记录进行简单比较,而不是比较

A word about performance: The assumption about ordering aids performance, because it allows us to simply compare each record against the immediately following record via lagged vectorized comparisons, rather than comparing every record in a group with every other record.

现在,对于 ave() cummax( )业务。在编写答案的初始版本后,我意识到我的解决方案中存在一个缺陷,而您的示例数据恰好没有暴露该缺陷。假设一组中有三个记录。如果第一个记录的范围与以下两个记录的重叠,然后中间记录的与第三条记录重叠,则我的(原始)代码将无法识别出第三条记录是前两条记录的同一重叠组的一部分。

Now, for the ave() and cummax() business. I realized after writing the initial version of my answer that there was a flaw in my solution, which happens to not be exposed by your example data. Say there are three records in a group. If the first record has a range that overlaps with both of the following two records, and then the middle record does not overlap with the third record, then my (original) code would fail to identify that the third record is part of the same overlapping group of the previous two records.

解决方案是不要简单地使用与以下记录进行比较时,使用当前记录的after.date ,而是使用组中的累积最大 after.date 。如果任何更早的记录完全超出了紧随其后的记录,那么它显然与该记录重叠,并且它的 after.date 对于为后续记录考虑重叠的组很重要。

The solution is to not simply use the after.date of the current record when comparing against the following record, but instead use the cumulative maximum after.date within the group. If any earlier record sprawled completely beyond its immediately following record, then it obviously overlapped with that record, and its after.date is what's important in considering overlapping groups for subsequent records.

以下是需要此修复的输入数据的演示,以您的 df 为基础:

Here's a demonstration of input data that requires this fix, using your df as a base:

df2 <- df;
df2[7,'after.date'] <- '2004-10-02';
df2;
##   ID before.date after.date
## 1  1  1996-10-01 1996-12-01
## 2  1  1998-01-01 2003-09-30
## 3  1  2000-01-01 2004-12-31
## 4  2  2001-01-01 2006-03-31
## 5  3  2001-01-01 2006-09-30
## 6  4  2001-01-01 2005-09-30
## 7  4  2004-10-01 2004-10-02
## 8  4  2004-10-03 2004-11-28

现在记录6与记录7和8重叠,但记录7不会与记录8重叠。该解决方案仍适用:

Now record 6 overlaps with both records 7 and 8, but record 7 does not overlap with record 8. The solution still works:

cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
do.call(rbind,by(df2,og,function(g) transform(g[1,],after.date=max(g$after.date))));
##   ID before.date after.date
## 0  1  1996-10-01 1996-12-01
## 1  1  1998-01-01 2004-12-31
## 2  2  2001-01-01 2006-03-31
## 3  3  2001-01-01 2006-09-30
## 4  4  2001-01-01 2005-09-30

这是 og 如果没有 ave() / cummax()修复,计算将是错误的:

Here's a proof that the og calculation would be wrong without the ave()/cummax() fix:

og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 5






对解决方案的细微调整,在 og 计算之前覆盖 .date ,并避免 max()调用(如果您打算用覆盖原始的 df ,则更有意义。新聚合):


Minor adjustment to the solution, to overwrite after.date in advance of the og computation, and avoid the max() call (makes more sense if you're planning on overwriting the original df with the new aggregation):

cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
df$after.date <- ave(df$after.date,df$ID,FUN=cummax);
df;
##   ID before.date after.date
## 1  1  1996-10-01 1996-12-01
## 2  1  1998-01-01 2003-09-30
## 3  1  2000-01-01 2004-12-31
## 4  2  2001-01-01 2006-03-31
## 5  3  2001-01-01 2006-09-30
## 6  4  2001-01-01 2005-09-30
## 7  4  2004-10-01 2005-09-30
## 8  4  2004-10-03 2005-09-30
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
df <- do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=g$after.date[nrow(g)])));
df;
##   ID before.date after.date
## 0  1  1996-10-01 1996-12-01
## 1  1  1998-01-01 2004-12-31
## 2  2  2001-01-01 2006-03-31
## 3  3  2001-01-01 2006-09-30
## 4  4  2001-01-01 2005-09-30

这篇关于将观察结果与重叠的日期结合起来的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆