如何根据R中的行之间的日期差异来过滤行? [英] How to filter rows based on difference in dates between rows in R?

查看:160
本文介绍了如何根据R中的行之间的日期差异来过滤行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在每个 id 中,我想保留至少91天的行。在我的数据框 df 下面, id = 1 有5行, id = 2 有1行。

Within each id, I would like to keep rows that are at least 91 days apart. In my dataframe df below, id=1 has 5 rows and id=2 has 1 row.

对于 id = 1 ,我只想保留第1,第3和第5行。

For id=1, I would like to keep only the 1st, 3rd and 5th rows.

这是因为如果我们比较第一个日期和第二个日期,它们差别32天。所以,删除第二个日期。我们继续比较第1和第3个日期,它们在152天之间有所不同。所以,我们保持第三个日期。

This is because if we compare 1st date and 2nd date, they differ by 32 days. So, remove 2nd date. We proceed to comparing 1st and 3rd date, and they differ by 152 days. So, we keep 3rd date.

现在,而不是使用第一个日期作为参考,我们使用第三个日期。第三个日期和第四个日期不同61天。所以,删除第四个日期。我们继续比较第三个日期和第五个日期,它们不同121天。所以,我们保持第5个日期。

Now, instead of using 1st date as reference, we use 3rd date. 3rd date and 4th date differ by 61 days. So, remove 4th date. We proceed to comparing 3rd date and 5th date, and they differ by 121 days. So, we keep 5th date.

最后,我们保留的日期是第1,第3和第5个日期。至于 id = 2 ,只有一行,所以我们保留。所需的结果显示在 dfnew 中。

In the end, the dates we keep are 1st, 3rd and 5th dates. As for id=2, there is only one row, so we keep that. The desired result is shown in dfnew.

df <- read.table(header = TRUE, text = "
id  var1  date        
 1  A     2006-01-01 
 1  B     2006-02-02 
 1  C     2006-06-02 
 1  D     2006-08-02 
 1  E     2007-12-01 
 2  F     2007-04-20 
",stringsAsFactors=FALSE)

dfnew <- read.table(header = TRUE, text = "
id  var1  date        
 1  A     2006-01-01 
 1  C     2006-06-02 
 1  E     2007-12-01 
 2  F     2007-04-20 
",stringsAsFactors=FALSE)

我只能想到开始将 df id 分组如下:

I can only think of starting with grouping the df by id as follows:

library(dplyr)
dfnew <- df %>% group_by(id)

但是,我不知道如何从这里继续。我应该继续执行 filter function或 slice

However, I am not sure of how to continue from here. Should I proceed with filter function or slice? If so, how?

推荐答案

使用 slice code> dplyr 是定义以下递归函数:

An alternative that uses slice from dplyr is to define the following recursive function:

library(dplyr)
f <- function(d, ind=1) {
  ind.next <- first(which(difftime(d,d[ind], units="days") > 90))
  if (is.na(ind.next))
    return(ind)
  else
    return(c(ind, f(d,ind.next)))
}

此功能在日期上运行列起始于 ind = 1 。然后,它找到下一个索引 ind.next ,它是日期大于90天的第一个索引(至少91天)从 ind 索引的日期起。请注意,如果没有这样的 ind.next ind.next == NA ,我们只返回 IND 。否则,我们从 ind.next 开始递归调用 f ,并将其结果与 ind 。此函数调用的最终结果是行索引间隔至少91天。

This function operates on the date column starting at ind = 1. It then finds the next index ind.next that is the first index for which the date is greater than 90 days (at least 91 days) from the date indexed by ind. Note that if there is no such ind.next, ind.next==NA and we just return ind. Otherwise, we recursively call f starting at ind.next and return its result concatenated with ind. The end result of this function call are the row indices separated by at least 91 days.

使用此功能,我们可以执行以下操作:

With this function, we can do:

result <- df %>% group_by(id) %>% slice(f(as.Date(date, format="%Y-%m-%d")))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
##     id  var1       date
##  <int> <chr>      <chr>
##1     1     A 2006-01-01
##2     1     C 2006-06-02
##3     1     E 2007-12-01
##4     2     F 2007-04-20

使用此函数假定 date 列按每个 id 组的升序排序。如果没有,我们可以在切片前排序日期。不确定这个的效率或递归调用的危险。希望大卫·阿伦堡或其他人可以对此发表评论。

The use of this function assumes that the date column is sorted in ascending order by each id group. If not, we can just sort the dates before slicing. Not sure about the efficiency of this or the dangers of recursive calls in R. Hopefully, David Arenburg or others can comment on this.

根据David Arenburg的建议,最好将 date 转换为Date类,而不是按组:

As suggested by David Arenburg, it is better to convert date to a Date class first instead of by group:

result <- df %>% mutate(date=as.Date(date, format="%Y-%m-%d")) %>%
                 group_by(id) %>% slice(f(date))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
##     id  var1       date
##  <int> <chr>     <date>
##1     1     A 2006-01-01
##2     1     C 2006-06-02
##3     1     E 2007-12-01
##4     2     F 2007-04-20

这篇关于如何根据R中的行之间的日期差异来过滤行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆