相差至少 30 分钟时间的子集观测 [英] Subset observations that differ by at least 30 minutes time

查看:12
本文介绍了相差至少 30 分钟时间的子集观测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 data.table(约 3000 万行),由 POSIXct 格式的 datetime 列和一个 id 组成 列和其他一些列(在示例中,我只留下了一个不相关的列 x 来证明还有其他列需要保留).dput 在帖子的底部.

I have a data.table (~30 million rows) consisting of a datetime column in POSIXct format, an id column and a few other columns (in the example, I just left one irrelevant column x to demonstrate that there are other columns present that need to be kept). A dput is at the bottom of the post.

head(DT)
#              datetime          x id
#1: 2016-04-28 16:20:18 0.02461368  1
#2: 2016-04-28 16:41:34 0.88953932  1
#3: 2016-04-28 16:46:07 0.31818101  1
#4: 2016-04-28 17:00:56 0.14711365  1
#5: 2016-04-28 17:09:11 0.54406602  1
#6: 2016-04-28 17:39:09 0.69280341  1

问:对于每个 id,我只需要对相差超过 30 分钟的观察结果进行子集化.什么是有效的 data.table 方法来执行此操作(如果可能,无需大量循环)?

Q: For each id, I need to subset only those observations that differ by more than 30 minutes time. What could be an efficient data.table approach to do this (if possible, without extensive looping)?

逻辑也可以描述为(就像我在下面的评论中一样):

The logic can also be described as (like in my comment below):

每个 id 始终保留第一行.至少为 30 的下一行第一个之后的分钟也应保留.让我们假设该行保留第 4 行.然后,计算第 4 行和第 4 行之间的时间差第 5:n 行并保留第一个相差超过 30 分钟的行,依此类推开

Per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

在下面的 dput 中,我添加了一个列 keep 来指示在此示例中应保留哪些行,因为它们与之前每个 id 保留的观察值相差超过 30 分钟.困难在于似乎有必要迭代地计算时间差(或者至少,我目前想不出更有效的方法).

In the dput below, I added a colum keep to indicate which rows should be kept in this example because they differ by more than 30 minutes from the previous observation that is kept per id. The difficulty is that it seems to be necessary to calculate the time differences iteratively (or at least, I cannot think of a more efficient approach at the moment).

library(data.table)
DT <- structure(list(
  datetime = structure(c(1461853218.81561, 1461854494.81561, 
    1461854767.81561, 1461855656.81561, 1461856151.81561, 1461857949.81561, 
    1461858601.81561, 1461858706.81561, 1461859078.81561, 1461859103.81561, 
    1461852799.81561, 1461852824.81561, 1461854204.81561, 1461855331.81561, 
    1461855633.81561, 1461856311.81561, 1461856454.81561, 1461857177.81561, 
    1461858662.81561, 1461858996.81561), class = c("POSIXct", "POSIXt")), 
  x = c(0.0246136845089495, 0.889539316063747, 0.318181007634848, 
  0.147113647311926, 0.544066024711356, 0.6928034061566, 0.994269776623696, 
  0.477795971091837, 0.231625785352662, 0.963024232536554, 0.216407935833558, 
  0.708530468167737, 0.758459537522867, 0.640506813768297, 0.902299045119435, 
  0.28915973729454, 0.795467417687178, 0.690705278422683, 0.59414202044718, 
  0.655705799115822), 
  id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
  keep = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
           FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)), 
  .Names = c("datetime", "x", "id", "keep"), 
  row.names = c(NA, -20L), 
  class = c("data.table", "data.frame"))

setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
   by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]

keep栏说明:

  • 第 2:3 行与第 1 行相差不到 30 分钟 -> 删除
  • 第 4 行与第 1 行相差超过 30 分钟 -> 保留
  • 第 5 行距离第 4 行不到 30 分钟 -> 删除
  • 第 6 行与第 4 行相差超过 30 分钟 -> 保持
  • ...

期望的输出:

desiredDT <- DT[(keep)]

<小时>

感谢我收到的三个专家答复.我在 1 和 1000 万行数据上测试了它们.这是基准测试的摘录.


Thanks for three expert answers I received. I tested them on 1 and 10 million rows of data. Here's an excerpt of the benchmarks.

a) 100 万行

microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2), 
               times = 3L, unit = "relative")
#Unit: relative
#              expr       min        lq      mean    median        uq      max neval
#   frank(DT_Frank)  1.286647  1.277104  1.185216  1.267769  1.140614 1.036749     3
# roland(DT_Roland)  1.000000  1.000000  1.000000  1.000000  1.000000 1.000000     3
#   eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901     3
#   eddi2(DT_Eddi2)  9.966078  9.915651  9.210168  9.866330  8.877769 8.070281     3

b) 1000 万行

microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2), 
                times = 3L, unit = "relative")
#Unit: relative
#              expr       min        lq      mean    median        uq       max neval
#   frank(DT_Frank)  1.019561  1.025427  1.026681  1.031061  1.030028  1.029037     3
# roland(DT_Roland)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000     3
#   eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143     3
#   eddi2(DT_Eddi2)  9.796800  9.693823  9.526193  9.594931  9.398969  9.211019     3

显然,@Frank 的 data.table 方法和 @Roland 的基于 Rcpp 的解决方案在性能上相似,但 Rcpp 略有优势,而 @eddi 的方法仍然很快,但性能不如其他方法.

Apparently, @Frank's data.table approach and @Roland's Rcpp based solution are similar in performance with Rcpp having a slight advantage, while @eddi's approaches were still fast but not as performant as the others.

但是,当我检查解决方案的相等性时,我发现@Roland 的方法与其他方法的结果略有不同:

However, when I checked for equality of the solutions, I found that @Roland's approach has a slightly different result than the others:

a) 100 万行

all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component "datetime": Numeric: lengths (982228, 982224) differ"
#[2] "Component "id": Numeric: lengths (982228, 982224) differ"      
#[3] "Component "x": Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE

b) 1000 万行

all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component "datetime": Numeric: lengths (9981898, 9981891) differ"
#[2] "Component "id": Numeric: lengths (9981898, 9981891) differ"      
#[3] "Component "x": Numeric: lengths (9981898, 9981891) differ"       
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE

我目前的假设是,这种差异可能与差异是 > 30 分钟还是 >= 30 分钟有关,尽管我还不确定.

My current assumption is that this difference might be related to whether the differnce is > 30 minutes or >= 30 minutes though I'm not sure about that yet.

最后的想法:我决定使用@Frank 的解决方案有两个原因:1. 它性能非常好,几乎等于 Rcpp 解决方案,以及 2. 它不需要我不太熟悉的另一个包然而(无论如何我都在使用 data.table)

Final thought: I decided to go with @Frank's solution for two reasons: 1. it performs very well, almost equal to the Rcpp solution, and 2. it doesn't require another package with which I'm not very familiar yet (I'm using data.table anyway)

推荐答案

我会这样做:

setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data

s = 0L
w = DT[, .I[1L], by=id]$V1

while (length(w)){
   s = s + 1L
   DT[w, tag := s]

   m = DT[w, .(id, datetime = datetime+30*60)]
   w = DT[m, which = TRUE, roll=-Inf]
   w = w[!is.na(w)]
}

给了

               datetime          x id  keep tag
 1: 2016-04-28 10:20:18 0.02461368  1  TRUE   1
 2: 2016-04-28 10:41:34 0.88953932  1 FALSE  NA
 3: 2016-04-28 10:46:07 0.31818101  1 FALSE  NA
 4: 2016-04-28 11:00:56 0.14711365  1  TRUE   2
 5: 2016-04-28 11:09:11 0.54406602  1 FALSE  NA
 6: 2016-04-28 11:39:09 0.69280341  1  TRUE   3
 7: 2016-04-28 11:50:01 0.99426978  1 FALSE  NA
 8: 2016-04-28 11:51:46 0.47779597  1 FALSE  NA
 9: 2016-04-28 11:57:58 0.23162579  1 FALSE  NA
10: 2016-04-28 11:58:23 0.96302423  1 FALSE  NA
11: 2016-04-28 10:13:19 0.21640794  2  TRUE   1
12: 2016-04-28 10:13:44 0.70853047  2 FALSE  NA
13: 2016-04-28 10:36:44 0.75845954  2 FALSE  NA
14: 2016-04-28 10:55:31 0.64050681  2  TRUE   2
15: 2016-04-28 11:00:33 0.90229905  2 FALSE  NA
16: 2016-04-28 11:11:51 0.28915974  2 FALSE  NA
17: 2016-04-28 11:14:14 0.79546742  2 FALSE  NA
18: 2016-04-28 11:26:17 0.69070528  2  TRUE   3
19: 2016-04-28 11:51:02 0.59414202  2 FALSE  NA
20: 2016-04-28 11:56:36 0.65570580  2  TRUE   4

OP 描述了其背后的想法 在评论中:

The idea behind it is described by the OP in a comment:

每个 id 始终保留第一行.在第一行之后至少 30 分钟的下一行也应保留.假设要保留的行是第 4 行.然后,计算第 4 行和第 5:n 行之间的时间差,并保留相差超过 30 分钟的第一个,依此类推

per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

这篇关于相差至少 30 分钟时间的子集观测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆