在R中发现观测值之间的时间差 [英] Finding Time Difference Between Observations in R

查看:105
本文介绍了在R中发现观测值之间的时间差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试确定两次观察之间的时间差。数据由不同的人分解,每个人都有自己的唯一ID。我有一个数据集,它告诉我每次更改时它们的状态都会更新,以及何时更改它们的状态。状态可以是两个值之一,并且它始终会更改为非零值(在这种情况下,从Y到N,或从N到Y)。

i'm trying to determine the time difference between two observations. The data is broken up by different individuals who each have their own unique ID. I have a dataset which tells me what their status updates to every time it changes, and at what time their status changed. Status can be one of two values, and it always changes to the value it is not (in this case, from Y to N, or N to Y).

数据如下所示:

ID Status Time
1    Y     2013-07-01 08:07:00      
2    Y     2013-07-01 08:07:03  
3    Y     2013-07-01 08:07:04      
4    Y     2013-07-01 08:07:06      
1    N     2013-07-01 08:07:07      
2    N     2013-07-01 08:07:23      
5    Y     2013-07-01 08:07:34  
6    Y     2013-07-01 08:07:45  
7    Y     2013-07-01 08:07:47  
1    Y     2013-07-01 08:07:56  
3    N     2013-07-01 08:07:58  

我想找到的是经过的时间每个ID的每次状态更改之间的时间,即从Y到N所需的时间。然后获得摘要统计信息,例如经过时间的分布,经过时间的平均值等。

What I would like to find is the amount of time which passes between each status change for each individual ID -- that is, how long it takes to get from Y to N. And then get summary statistics like the distribution of elapsed times, mean of elapsed times, etc.

因此,示例输出可能看起来像这样,将上面发生的三个Y到N开关连线(1个开关,2个开关和3个开关)

So an example output might look like this, recording the three Y to N switches which occurred above (1 switched, 2 switched, and 3 switched)

Y to N change    Time elapsed (in seconds)
1                     7 
2                     20
3                     54

由于某种原因,我对此感到很麻烦。现在,我将时间设为POSIXlt格式,并将ID和状态作为一个因素。我尝试使用ddply按ID对数据进行排序,然后按时间戳对数据进行排序,但是到目前为止,这种方法还没有奏效。任何建议将不胜感激!

I'm having a lot of trouble with this for some reason. Right now I have the time in POSIXlt format, and the ID and status as a factor. I have tried using ddply to sort the data by ID and then by timestamp, but this hasn't worked so far. Any advice would be much appreciated!

编辑:将时间更改为实际上正确的类型。

edit: changed time to actually be in the correct type.

Edit2 :最终在等待更多答案的同时写了一个解决方案。我的方法比这里的许多解决方案都要糟糕得多,但是我做到了:

ended up writing a solution while waiting for more answers. My way is much uglier than many of the solutions here, but I did:

N <- ifelse(df$Status=="N",1,0)
Y <- ifelse(df$Status== "Y",1,0)

#making a vector which is 1 for a row if the item status of the row below it is N
var1 <- N
for (i in 1:nrow(df)) {
  var1[i] <- N[i+1]
}

#making a vector which is TRUE if a row's item status is Y and the row after is N
check <- ifelse(var1==s & var1==1,TRUE,FALSE)
#had to define the last one as FALSE manually because the for loop above would miss the last entry due to how it was constructed
check [50000]=FALSE



#made a loop which finds the time difference for a row's TIME and the row below it, given that "check " is true for that row, and writes that to a results vector.
#here is the results vector
results <- numeric(nrow(df))
#here is the for loop
for (i in 1:nrow(df)) {
  if(check [i]){
    results[i] <- difftime(df$Time[i],df$Time[i+1])
  }
}

我本来是用for循环解决这个问题的,但是在我的〜100万行中实际的数据集太慢了,所以我做了向量化的工作。这些其他解决方案是否可以处理那么大的数据?

I originally had this solved with a for loop, but over the ~1 million rows of my actual dataset it was way too slow, so I did this vectorization stuff. Would these other solutions work on data that large? I will definitely be trying them out!

推荐答案

这里是另一种方法。我试图将所有数据保留在此处的最终输出中。请注意,出于演示目的,我对您的数据做了一些修改。在我的代码中,我首先按 ID Time 排列数据。然后,我将状态(即Y和N)更改为0和1,以创建 group 。在这里,可以告诉我们状态的更改时间。如果看到相同的数字持续出现几行,则表示状态尚未更改。然后,我为每个ID计算了时差(即 gap )。最后,我将 gap 的值更改为NA,但这些值未出现在每个组的第一行中。也就是说,我做了不必要的差距NA。请注意,每个ID的第一个观察值在 gap 中也具有NA。 差距位居第二。

Here is another approach. I tried to leave all data in the final output here. Please note, for demonstration purposes, I modified your data a bit. In my code, I first arranged data by ID and Time. I, then, changed Status (i.e.,Y and N) to 0 and 1 in order to create group. Here, group can tell us when Status changed. If you see a same number going on for a few rows, that means Status has not changed. I then, calculated time difference (i.e., gap) for each ID. Finally, I changed gap values which do not appear in the first row for each group to NA. That is, I made unnecessary gaps NAs. Please note that the first observation for each ID has NA in gap as well. gap is in second.

ann <- data.frame(ID = c(1,2,3,4,1,2,2,1,1,1,3),
                  Status = c("Y", "Y", "Y", "Y",
                             "N", "N", "Y", "Y", "Y", "N", "N"),
                  Time = c("2013-07-01 08:07:00", "2013-07-01 08:07:03",
                           "2013-07-01 08:07:04", "2013-07-01 08:07:06",
                           "2013-07-01 08:07:07", "2013-07-01 08:07:23",
                           "2013-07-01 08:07:34", "2013-07-01 08:07:45",
                           "2013-07-01 08:07:47", "2013-07-01 08:07:56",
                           "2013-07-01 08:07:58"),
                  stringsAsFactors = FALSE)

ann$Time <- as.POSIXct(ann$Time)

#   ID Status                Time
#1   1      Y 2013-07-01 08:07:00
#2   2      Y 2013-07-01 08:07:03
#3   3      Y 2013-07-01 08:07:04
#4   4      Y 2013-07-01 08:07:06
#5   1      N 2013-07-01 08:07:07
#6   2      N 2013-07-01 08:07:23
#7   2      Y 2013-07-01 08:07:34
#8   1      Y 2013-07-01 08:07:45
#9   1      Y 2013-07-01 08:07:47
#10  1      N 2013-07-01 08:07:56
#11  3      N 2013-07-01 08:07:58

ann %>%
    arrange(ID, Time) %>%
    group_by(ID) %>%
    mutate(Status = ifelse(Status == "Y", 1, 0),
           group = cumsum(c(T, diff(Status) != 0)),
           gap = Time - lag(Time)) %>%
    group_by(ID, group) %>%
    mutate(gap = ifelse(row_number() != 1, NA, gap))

#   ID Status                Time group gap
#1   1      1 2013-07-01 08:07:00     1  NA
#2   1      0 2013-07-01 08:07:07     2   7
#3   1      1 2013-07-01 08:07:45     3  38
#4   1      1 2013-07-01 08:07:47     3  NA
#5   1      0 2013-07-01 08:07:56     4   9
#6   2      1 2013-07-01 08:07:03     1  NA
#7   2      0 2013-07-01 08:07:23     2  20
#8   2      1 2013-07-01 08:07:34     3  11
#9   3      1 2013-07-01 08:07:04     1  NA
#10  3      0 2013-07-01 08:07:58     2  54
#11  4      1 2013-07-01 08:07:06     1  NA

这篇关于在R中发现观测值之间的时间差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆