如何根据R中行之间的日期差异过滤行? [英] How to filter rows based on difference in dates between rows in R?
问题描述
在每个 id
中,我想保留至少相隔 91 天的行.在我下面的数据帧 df
中,id=1
有 5 行,id=2
有 1 行.
Within each id
, I would like to keep rows that are at least 91 days apart. In my dataframe df
below, id=1
has 5 rows and id=2
has 1 row.
对于id=1
,我只想保留第一、第三和第五行.
For id=1
, I would like to keep only the 1st, 3rd and 5th rows.
这是因为如果我们比较第 1 个日期和第 2 个日期,它们相差 32 天.因此,删除第二个日期.我们继续比较第 1 个和第 3 个日期,它们相差 152 天.所以,我们保持第三次约会.
This is because if we compare 1st date and 2nd date, they differ by 32 days. So, remove 2nd date. We proceed to comparing 1st and 3rd date, and they differ by 152 days. So, we keep 3rd date.
现在,我们不再使用第一个日期作为参考,而是使用第三个日期.第三个日期和第四个日期相差 61 天.因此,删除第 4 个日期.我们继续比较第 3 个日期和第 5 个日期,它们相差 121 天.所以,我们保持第 5 次约会.
Now, instead of using 1st date as reference, we use 3rd date. 3rd date and 4th date differ by 61 days. So, remove 4th date. We proceed to comparing 3rd date and 5th date, and they differ by 121 days. So, we keep 5th date.
最后,我们保留的日期是第 1 个、第 3 个和第 5 个日期.至于id=2
,只有一行,所以我们保留.所需的结果显示在 dfnew
中.
In the end, the dates we keep are 1st, 3rd and 5th dates. As for id=2
, there is only one row, so we keep that. The desired result is shown in dfnew
.
df <- read.table(header = TRUE, text = "
id var1 date
1 A 2006-01-01
1 B 2006-02-02
1 C 2006-06-02
1 D 2006-08-02
1 E 2007-12-01
2 F 2007-04-20
",stringsAsFactors=FALSE)
dfnew <- read.table(header = TRUE, text = "
id var1 date
1 A 2006-01-01
1 C 2006-06-02
1 E 2007-12-01
2 F 2007-04-20
",stringsAsFactors=FALSE)
我只能想到将 df
按 id
分组如下:
I can only think of starting with grouping the df
by id
as follows:
library(dplyr)
dfnew <- df %>% group_by(id)
但是,我不确定如何从这里继续.我应该继续使用 filter
函数还是 slice
?如果是,怎么办?
However, I am not sure of how to continue from here. Should I proceed with filter
function or slice
? If so, how?
推荐答案
使用 dplyr
中的 slice
的替代方案是定义以下递归函数:
An alternative that uses slice
from dplyr
is to define the following recursive function:
library(dplyr)
f <- function(d, ind=1) {
ind.next <- first(which(difftime(d,d[ind], units="days") > 90))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d,ind.next)))
}
此函数对从 ind = 1
开始的 date
列进行操作.然后查找下一个索引 ind.next
,即 first
索引,其日期距 ind
.注意,如果没有这样的ind.next
,ind.next==NA
,我们只返回ind
.否则,我们从 ind.next
开始递归调用 f
并返回其与 ind
连接的结果.此函数调用的最终结果是至少间隔 91 天的行索引.
This function operates on the date
column starting at ind = 1
. It then finds the next index ind.next
that is the first
index for which the date is greater than 90 days (at least 91 days) from the date indexed by ind
. Note that if there is no such ind.next
, ind.next==NA
and we just return ind
. Otherwise, we recursively call f
starting at ind.next
and return its result concatenated with ind
. The end result of this function call are the row indices separated by at least 91 days.
有了这个函数,我们可以:
With this function, we can do:
result <- df %>% group_by(id) %>% slice(f(as.Date(date, format="%Y-%m-%d")))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
## id var1 date
## <int> <chr> <chr>
##1 1 A 2006-01-01
##2 1 C 2006-06-02
##3 1 E 2007-12-01
##4 2 F 2007-04-20
此函数的使用假定date
列按每个id
组按升序排列.如果没有,我们可以在切片之前对日期进行排序.不确定此操作的效率或 R 中递归调用的危险.希望 David Arenburg 或其他人可以对此发表评论.
The use of this function assumes that the date
column is sorted in ascending order by each id
group. If not, we can just sort the dates before slicing. Not sure about the efficiency of this or the dangers of recursive calls in R. Hopefully, David Arenburg or others can comment on this.
按照 David Arenburg 的建议,最好先将 date
转换为 Date 类,而不是按组:
As suggested by David Arenburg, it is better to convert date
to a Date class first instead of by group:
result <- df %>% mutate(date=as.Date(date, format="%Y-%m-%d")) %>%
group_by(id) %>% slice(f(date))
##Source: local data frame [4 x 3]
##Groups: id [2]
##
## id var1 date
## <int> <chr> <date>
##1 1 A 2006-01-01
##2 1 C 2006-06-02
##3 1 E 2007-12-01
##4 2 F 2007-04-20
这篇关于如何根据R中行之间的日期差异过滤行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!