R:按时间间隔计算日期 [英] R: Counting dates within time intervals

查看:159
本文介绍了R:按时间间隔计算日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有数据输入:

  df.in<  -  data.frame(event = c ,3,4,5),
start = c(2015-01-01,2015-01-01,2015-01-02,
2015-01-02 ,2015-01-03),
end = c(2015-01-03,2015-01-04,2015-01-03,
01-05,2015-01-05))
df.in $ start< - as.Date(df.in $ start,%Y-%m-%d)
df.in $ end< - as.Date(df.in $ end,%Y-%m-%d)

> df.in
事件开始结束
1 1 2015-01-01 2015-01-03
2 2 2015-01-01 2015-01-04
3 3 2015- 01-02 2015-01-03
4 4 2015-01-02 2015-01-05
5 5 2015-01-03 2015-01-05

目标是计算所有事件的日期发生情况(包括开始,不包括结束)。要填写这个数据框:

  df.out<  -  data.frame(date = c(2015-01- 01,2015-01-02,2015-01-03,
2015-01-04,2015-01-05),
count = 0)
df.out $ date< - as.Date(df.out $ date,%Y-%m-%d)
> df.out
日期计数
1 2015-01-01 0
2 2015-01-02 0
3 2015-01-03 0
4 2015-01 -04 0
5 2015-01-05 0

从概念上看,它会像这样:

 #1 ** 
#2 ****
#3 ***
#4 **
#5

所以,我目前的想法是一个循环:

  for(i in seq_along(df.out $ date)){
temp.df< - df。在[df.in $ start< = df.out $ date [i]]中,
df.out $ count [i]< - nrow(temp.df) - nrow(temp.df [ df $ end< = df.out $ date [i],])
}
> df.out
日期计数
1 2015-01-01 2
2 2015-01-02 4
3 2015-01-03 3
4 2015-01 -04 2
5 2015-01-05 0

它的作品,但我是排序害怕我正在调用的这个 temp.df 可能会将雪球变成非常大的东西。鉴于事件数量可以轻松达到数十甚至数十万。



所以我的问题是 - 可以有一个更有效的方式吗?也许通过使用一些日期包,如 lubridate 哪里可以以某种方式向量化整个事情?

解决方案

所以我已经对 data.table :: foverlaps()。我会把研究结果留给任何可能会发现有用的结果,因为我真的没有在搜索类似的帖子中找到这些小东西。鉴于我们正在比较间隔,只有在 y 参数之间有间隔,这是 df.in 在这种特殊情况下 - 我们必须人为地做一个。例如在 df.out $ date2< - df.out $ date 中。此外,没有简单的(或者我找不到任何)设置包含或排除设置间隔端点的方式。鉴于我们要在 df.in $ end 中排除端点,我们必须在数据表本身上使用简单的简单 df手动执行。在$ end< - df.in $ end - 1



长篇小说,这里是一个工作示例:

  require(data.table)
df.out< - data.table(date = c(2015-01-01 ,2015-01-02,2015-01-03,
2015-01-04,2015-01-05),
count = 0)
df.out $ date< - as.Date(df.out $ date,%Y-%m-%d)

df.in< - data.table(event = c(1,2,3,4,5),
start = c(2015-01-01,2015-01-01,2015-01-02,
2015-01-02,2015-01-03),
end = c(2015-01-03,2015-01-04,2015-01-03,
2015-01-05,2015-01-05))
df.in $ start< - as.Date(df.in $ start,%Y-%m-%d )
df.in $ end< - as.Date(df.in $ end,%Y-%m-%d) - 1

setkey(df.i n,开始,结束)
df.out $ date2< - df.out $ date
df.test< - foverlaps(x = df.out,y = df.in,type =在,by.x = c(date,date2),by.y = c(start,end))
df.test $ count [!is.na(df。 test $ event)]< - 1
aggregate(count〜date,data = df.test,sum)

日期计数
1 2015-01-01 2
2 2015-01-02 4
3 2015-01-03 3
4 2015-01-04 2
5 2015-01-05 0






或者,您可以执行



<数据

  df.out<  -  data.table(date = as.Date c(2015-01-01,2015-01-02,2015-01-03,
2015-01-04,2015-01-05)))

df.in< - data.table(event = 1:5,
start = as.Date(c(2015-01-01,2015-01-01 2015-01-02,
2015-01-02,2015-01-03)),
end = as.Date(c(2015-01-03 2015-01-04,2015-01-03,
2015-01-05,2015-01-05)))

解决方案

  df.out [,`:=`(start = date,end = date) $ b df.in [,end:= end  -  1L] 
setkey(df.out,start,end)
foverlaps(df.in,df.out)[,。(count = ),by = date]
#date count
#1:2015-01-01 2
#2:2015-01-02 4
#3:2015-01- 03 3
#4:2015-01-04 2

,如果要更新 df.out ,您也可以执行

  res<  -  foverlaps(df.in,df.out,which = TRUE)[,.N,by = yid] 
df.out [res $ yid,Count:= res $ N]
df.out [is.na(Count),Count:= 0L]


Assume we have data input:

df.in <- data.frame(event = c(1,2,3,4,5), 
                    start = c("2015-01-01", "2015-01-01", "2015-01-02",
                              "2015-01-02", "2015-01-03"),
                    end = c("2015-01-03", "2015-01-04", "2015-01-03",
                            "2015-01-05", "2015-01-05"))
df.in$start <- as.Date(df.in$start, "%Y-%m-%d")
df.in$end <- as.Date(df.in$end, "%Y-%m-%d")

> df.in
  event      start        end
1     1 2015-01-01 2015-01-03
2     2 2015-01-01 2015-01-04
3     3 2015-01-02 2015-01-03
4     4 2015-01-02 2015-01-05
5     5 2015-01-03 2015-01-05

Goal is to count date occurrences for all events (including start, excluding end). To fill out this data frame:

df.out <- data.frame(date = c("2015-01-01", "2015-01-02", "2015-01-03", 
                              "2015-01-04", "2015-01-05"),
                     count = 0)
df.out$date <- as.Date(df.out$date, "%Y-%m-%d")
> df.out
        date count
1 2015-01-01     0
2 2015-01-02     0
3 2015-01-03     0
4 2015-01-04     0
5 2015-01-05     0

Conceptually it would look something like this:

#1 **
#2 ****
#3 ***
#4 **
#5 

So, my current idea is a loop:

for(i in seq_along(df.out$date)){
  temp.df <- df.in[df.in$start <= df.out$date[i],]
  df.out$count[i] <- nrow(temp.df) - nrow(temp.df[temp.df$end <= df.out$date[i],])
}
> df.out
        date count
1 2015-01-01     2
2 2015-01-02     4
3 2015-01-03     3
4 2015-01-04     2
5 2015-01-05     0

It works, but I am sort of afraid that this temp.df that I am invoking can potentially snowball into something very large. Given that count of events can easily go into tens or even hundreds of thousands.

So my question is - can there be a more efficient way? Perhaps by using some date packages such as lubridate where I can somehow vectorize the whole thing?

解决方案

So I've done my research on data.table::foverlaps(). I'll leave my findings to whoever might find it useful as I honestly didn't really find these little things in searching similar posts.

Given that we are comparing intervals and we have interval only on y argument which is df.in in this particular case - we have to artificially make one. As in df.out$date2 <- df.out$date for example. Also, there is no straightforward (or I couldn't find any) way to set inclusion or exclusion of set interval endpoints. Given that we want to exclude endpoint in df.in$end we'll have to do it manually on the data table itself with plain simple df.in$end <- df.in$end - 1.

Long story short, here is a working example:

require(data.table)
df.out <- data.table(date = c("2015-01-01", "2015-01-02", "2015-01-03", 
                              "2015-01-04", "2015-01-05"),
                     count = 0)
df.out$date <- as.Date(df.out$date, "%Y-%m-%d")

df.in <- data.table(event = c(1,2,3,4,5), 
                    start = c("2015-01-01", "2015-01-01", "2015-01-02",
                              "2015-01-02", "2015-01-03"),
                    end = c("2015-01-03", "2015-01-04", "2015-01-03",
                            "2015-01-05", "2015-01-05"))
df.in$start <- as.Date(df.in$start, "%Y-%m-%d")
df.in$end <- as.Date(df.in$end, "%Y-%m-%d") - 1

setkey(df.in, start, end)
df.out$date2 <- df.out$date
df.test <- foverlaps(x = df.out, y = df.in, type = "within", by.x = c("date", "date2"), by.y = c("start", "end"))
df.test$count[!is.na(df.test$event)] <- 1
aggregate(count ~ date, data = df.test, sum)

        date count
1 2015-01-01     2
2 2015-01-02     4
3 2015-01-03     3
4 2015-01-04     2
5 2015-01-05     0


Alternatively, you could do

Data

df.out <- data.table(date = as.Date(c("2015-01-01", "2015-01-02", "2015-01-03", 
                              "2015-01-04", "2015-01-05")))

df.in <- data.table(event = 1:5, 
                    start = as.Date(c("2015-01-01", "2015-01-01", "2015-01-02",
                              "2015-01-02", "2015-01-03")),
                    end = as.Date(c("2015-01-03", "2015-01-04", "2015-01-03",
                            "2015-01-05", "2015-01-05")))

Solution

df.out[, `:=`(start = date, end = date)]
df.in[, end := end - 1L]
setkey(df.out, start, end)
foverlaps(df.in, df.out)[, .(count = .N), by = date]
#          date count
# 1: 2015-01-01     2
# 2: 2015-01-02     4
# 3: 2015-01-03     3
# 4: 2015-01-04     2

Or, if you want to update df.out, you could also do

res <- foverlaps(df.in, df.out, which = TRUE)[, .N, by = yid]
df.out[res$yid, Count := res$N]
df.out[is.na(Count), Count := 0L]

这篇关于R:按时间间隔计算日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆