R：按时间间隔计算日期 [英] R: Counting dates within time intervals

查看：159 发布时间：2017/4/8 20:05:47 r date

本文介绍了R：按时间间隔计算日期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我们有数据输入：

  df.in<  -  data.frame（event = c ，3,4,5），
 start = c（2015-01-01，2015-01-01，2015-01-02，
2015-01-02 ，2015-01-03），
 end = c（2015-01-03，2015-01-04，2015-01-03，
 01-05，2015-01-05））
 df.in $ start<  -  as.Date（df.in $ start，％Y-％m-％d）
 df.in $ end<  -  as.Date（df.in $ end，％Y-％m-％d）
 
> df.in 
事件开始结束
 1 1 2015-01-01 2015-01-03 
 2 2 2015-01-01 2015-01-04 
 3 3 2015- 01-02 2015-01-03 
 4 4 2015-01-02 2015-01-05 
 5 5 2015-01-03 2015-01-05

目标是计算所有事件的日期发生情况（包括开始，不包括结束）。要填写这个数据框：

  df.out<  -  data.frame（date = c（2015-01- 01，2015-01-02，2015-01-03，
2015-01-04，2015-01-05），
 count = 0）
 df.out $ date<  -  as.Date（df.out $ date，％Y-％m-％d）
> df.out 
日期计数
 1 2015-01-01 0 
 2 2015-01-02 0 
 3 2015-01-03 0 
 4 2015-01 -04 0 
 5 2015-01-05 0

从概念上看，它会像这样：

 ＃1 ** 
＃2 **** 
＃3 *** 
＃4 ** 
＃5

所以，我目前的想法是一个循环：

  for（i in seq_along（df.out $ date））{
 temp.df<  -  df。在[df.in $ start< = df.out $ date [i]]中，
 df.out $ count [i]<  -  nrow（temp.df） -  nrow（temp.df [ df $ end< = df.out $ date [i]，]）
} 
> df.out 
日期计数
 1 2015-01-01 2 
 2 2015-01-02 4 
 3 2015-01-03 3 
 4 2015-01 -04 2 
 5 2015-01-05 0

它的作品，但我是排序害怕我正在调用的这个 temp.df 可能会将雪球变成非常大的东西。鉴于事件数量可以轻松达到数十甚至数十万。

所以我的问题是 - 可以有一个更有效的方式吗？也许通过使用一些日期包，如 lubridate 哪里可以以某种方式向量化整个事情？

解决方案

所以我已经对 data.table :: foverlaps（）。我会把研究结果留给任何可能会发现有用的结果，因为我真的没有在搜索类似的帖子中找到这些小东西。鉴于我们正在比较间隔，只有在 y 参数之间有间隔，这是 df.in 在这种特殊情况下 - 我们必须人为地做一个。例如在 df.out $ date2< - df.out $ date 中。此外，没有简单的（或者我找不到任何）设置包含或排除设置间隔端点的方式。鉴于我们要在 df.in $ end 中排除端点，我们必须在数据表本身上使用简单的简单 df手动执行。在$ end< - df.in $ end - 1 。

长篇小说，这里是一个工作示例：

  require（data.table）
 df.out<  -  data.table（date = c（2015-01-01 ，2015-01-02，2015-01-03，
2015-01-04，2015-01-05），
 count = 0）
 df.out $ date<  -  as.Date（df.out $ date，％Y-％m-％d）
 
 df.in<  -  data.table（event = c（1,2,3,4,5），
 start = c（2015-01-01，2015-01-01，2015-01-02，
 2015-01-02，2015-01-03），
 end = c（2015-01-03，2015-01-04，2015-01-03，
2015-01-05，2015-01-05））
 df.in $ start<  -  as.Date（df.in $ start，％Y-％m-％d ）
 df.in $ end<  -  as.Date（df.in $ end，％Y-％m-％d） -  1 
 
 setkey（df.i n，开始，结束）
 df.out $ date2<  -  df.out $ date 
 df.test<  -  foverlaps（x = df.out，y = df.in，type =在，by.x = c（date，date2），by.y = c（start，end））
 df.test $ count [！is.na（df。 test $ event）]<  -  1 
 aggregate（count〜date，data = df.test，sum）
 
日期计数
 1 2015-01-01 2 
 2 2015-01-02 4 
 3 2015-01-03 3 
 4 2015-01-04 2 
 5 2015-01-05 0

或者，您可以执行

<数据

  df.out<  -  data.table（date = as.Date c（2015-01-01，2015-01-02，2015-01-03，
2015-01-04，2015-01-05）））
 
 df.in<  -  data.table（event = 1：5，
 start = as.Date（c（2015-01-01，2015-01-01 2015-01-02，
2015-01-02，2015-01-03）），
 end = as.Date（c（2015-01-03 2015-01-04，2015-01-03，
 2015-01-05，2015-01-05）））

解决方案

  df.out [，`：=`（start = date，end = date） $ b df.in [，end：= end  -  1L] 
 setkey（df.out，start，end）
 foverlaps（df.in，df.out）[，。（count = ），by = date] 
＃date count 
＃1：2015-01-01 2 
＃2：2015-01-02 4 
＃3：2015-01- 03 3 
＃4：2015-01-04 2

或，如果要更新 df.out ，您也可以执行

  res<  -  foverlaps（df.in，df.out，which = TRUE）[，.N，by = yid] 
 df.out [res $ yid，Count：= res $ N] 
 df.out [is.na（Count），Count：= 0L]

Assume we have data input:

df.in <- data.frame(event = c(1,2,3,4,5), 
                    start = c("2015-01-01", "2015-01-01", "2015-01-02",
                              "2015-01-02", "2015-01-03"),
                    end = c("2015-01-03", "2015-01-04", "2015-01-03",
                            "2015-01-05", "2015-01-05"))
df.in$start <- as.Date(df.in$start, "%Y-%m-%d")
df.in$end <- as.Date(df.in$end, "%Y-%m-%d")

> df.in
  event      start        end
1     1 2015-01-01 2015-01-03
2     2 2015-01-01 2015-01-04
3     3 2015-01-02 2015-01-03
4     4 2015-01-02 2015-01-05
5     5 2015-01-03 2015-01-05

Goal is to count date occurrences for all events (including start, excluding end). To fill out this data frame:

df.out <- data.frame(date = c("2015-01-01", "2015-01-02", "2015-01-03", 
                              "2015-01-04", "2015-01-05"),
                     count = 0)
df.out$date <- as.Date(df.out$date, "%Y-%m-%d")
> df.out
        date count
1 2015-01-01     0
2 2015-01-02     0
3 2015-01-03     0
4 2015-01-04     0
5 2015-01-05     0

Conceptually it would look something like this:

#1 **
#2 ****
#3 ***
#4 **
#5

So, my current idea is a loop:

for(i in seq_along(df.out$date)){
  temp.df <- df.in[df.in$start <= df.out$date[i],]
  df.out$count[i] <- nrow(temp.df) - nrow(temp.df[temp.df$end <= df.out$date[i],])
}
> df.out
        date count
1 2015-01-01     2
2 2015-01-02     4
3 2015-01-03     3
4 2015-01-04     2
5 2015-01-05     0

It works, but I am sort of afraid that this temp.df that I am invoking can potentially snowball into something very large. Given that count of events can easily go into tens or even hundreds of thousands.

So my question is - can there be a more efficient way? Perhaps by using some date packages such as lubridate where I can somehow vectorize the whole thing?
解决方案
So I've done my research on data.table::foverlaps(). I'll leave my findings to whoever might find it useful as I honestly didn't really find these little things in searching similar posts.

Given that we are comparing intervals and we have interval only on y argument which is df.in in this particular case - we have to artificially make one. As in df.out$date2 <- df.out$date for example. Also, there is no straightforward (or I couldn't find any) way to set inclusion or exclusion of set interval endpoints. Given that we want to exclude endpoint in df.in$end we'll have to do it manually on the data table itself with plain simple df.in$end <- df.in$end - 1.

Long story short, here is a working example:
require(data.table) df.out <- data.table(date = c("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04", "2015-01-05"), count = 0) df.out$date <- as.Date(df.out$date, "%Y-%m-%d") df.in <- data.table(event = c(1,2,3,4,5), start = c("2015-01-01", "2015-01-01", "2015-01-02", "2015-01-02", "2015-01-03"), end = c("2015-01-03", "2015-01-04", "2015-01-03", "2015-01-05", "2015-01-05")) df.in$start <- as.Date(df.in$start, "%Y-%m-%d") df.in$end <- as.Date(df.in$end, "%Y-%m-%d") - 1 setkey(df.in, start, end) df.out$date2 <- df.out$date df.test <- foverlaps(x = df.out, y = df.in, type = "within", by.x = c("date", "date2"), by.y = c("start", "end")) df.test$count[!is.na(df.test$event)] <- 1 aggregate(count ~ date, data = df.test, sum) date count 1 2015-01-01 2 2 2015-01-02 4 3 2015-01-03 3 4 2015-01-04 2 5 2015-01-05 0

Alternatively, you could do

Data
df.out <- data.table(date = as.Date(c("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04", "2015-01-05"))) df.in <- data.table(event = 1:5, start = as.Date(c("2015-01-01", "2015-01-01", "2015-01-02", "2015-01-02", "2015-01-03")), end = as.Date(c("2015-01-03", "2015-01-04", "2015-01-03", "2015-01-05", "2015-01-05")))
Solution
df.out[, `:=`(start = date, end = date)] df.in[, end := end - 1L] setkey(df.out, start, end) foverlaps(df.in, df.out)[, .(count = .N), by = date] # date count # 1: 2015-01-01 2 # 2: 2015-01-02 4 # 3: 2015-01-03 3 # 4: 2015-01-04 2
Or, if you want to update df.out, you could also do
res <- foverlaps(df.in, df.out, which = TRUE)[, .N, by = yid] df.out[res$yid, Count := res$N] df.out[is.na(Count), Count := 0L]

这篇关于R：按时间间隔计算日期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R：按时间间隔计算日期 [英] R: Counting dates within time intervals

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R：按时间间隔计算日期 [英] R: Counting dates within time intervals

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭