计算时间戳附近窗口中的行数 [英] Count number of rows in window around timestamp

查看:64
本文介绍了计算时间戳附近窗口中的行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table,其中包含POSIXct格式的时间序列和一些功能(请参见下面的示例)。我想添加一列,该列包含时间戳中 w 秒的窗口内的行数。



所以本质上我需要的是这个...

  dt [,COUNT:= 0] 
for(i in seq (nrow(dt))){
interval_start<-dt [i,(TIMESTAMP-w)]
interval_end<-dt [i,(TIMESTAMP + w)]
dt [ i, COUNT]<-nrow(dt [TIMESTAMP> interval_start& TIMESTAMP< interval_end])
}

...这显然很慢,通常不是这样做的方法。我经常使用 lapply() .SD 玩很多次,自定义函数传递给 lapply (),但无法滚动加入。



我认为我的第一篇文章似乎并不容易。

  TIMESTAMP FEAT1 FEAT2 COUNT(w = 2000 secs)
1:2017-11-16 02:50:19 2332843 1282 2
2:2017-11-16 03:01:38 40913 129 2
3:2017-11-16 08:07:03 2758077 15281 1
4:2017-11-16 09:33:31 156899 448 1
5:2017-11- 16 11:00:04 3639410 1901 2
6:2017-11-16 11:01:50 46274 242 2
7:2017-11-16 12:00:46 3336248 1975 1
8:2017-11-16 16:31:16 3262457 1006 2
9:2017-11-16 16:37:33 3110064 840 2
10:2017-11-16 17:23: 06 3059651 765 3
11:2017-11-16 17:31:57 51569 143 3
12:2017-11-16 17:35:17 2254123 899 3
13:2017- 11-16 18:32:47 2321751 1182 3
14:2017-11-16 19:02:11 469452 1257 3
15:2017-11-16 19:02:23 2248207 923 3
16:2017-11-16 19:40:25 62245 150 1
17:2017-11-16 20:54:38 2245512 936 2
18:2017-11-16 21:26:35 4191734 1044 3
19:2017-11-16 21:30:08 2202018 958 2
20:2017-11-16 23:42:37 2434562 2559 1

更新

现在,我使用@Gautam的方法进行一些调整以提高速度。一些实验表明,这比循环快得多(20k行:35 vs 7秒,100k行:408 vs 175秒)。

  tme<-dt $ tme 
降低<-tme-w
上限值<-tme + w
dt $ count<-sapply(1:length(tme) ,function(z)return(sum(tme> lower [z]& tme< upper [z]))))

不过,如果有人知道这样做的数据表方式,我很想知道。

解决方案

这也可以使用 non-equi join 来解决,这似乎也要快得多(请参见下面的基准)

  w<-2000 
DT [,COUNT:= DT [。(lb = TIMESTAMP-w,ub = TIMESTAMP + w),
on =。(TIMESTAMP > lb,TIMESTAMP



< blockquote>

  TIMESTAMP FEAT1 FEAT2 COUNT 
1:2017-11-16 02:50:19 2332843 1282 2
2:2017-11 -16 03:01:38 40913 129 2
3:2017-11-16 08:07:03 2758077 15281 1
4:2017-11-16 09:33:31 156899 448 1
5:2017-11-16 11:00:04 3639410 1901 2
6:2017-11-16 11:01:50 46274 242 2
7:2017-11-16 12:00:46 3336248 1975 1
8:2017-11-16 16:31:16 3262457 1006 2
9:2017-11-16 16:37:33 3110064 840 2
10:2017-11-16 17:23:06 3059651 765 3
11:2017-11-16 17:31:57 51569 143 3
12:2017-11-16 17:35:17 2254123 899 3
13:2017-11 -16 18:32:47 2321751 1182 3
14:2017-11-16 19:02:11 469452 1257 3
15:2017-11-16 19:02:23 2248207 923 3
16:2017-11-16 19:40:25 62245 150 1
17:2017-11-16 20:54:38 2245512 936 2
18:2017-11-16 21:26 :35 4191734 1044 3
19:2017-11-16 21:30:08 2202018 958 2
20:2017-11-16 23:42:37 2434562 2559 1




数据



  DT<-readr :: read_table( 
TIMESTAMP FEAT1 FEAT2 COUNT
1:2017-11-16 02:50:19 2332843 1282 2
2:2017-11-16 03:01:38 40913 129 2
3:2017-11-16 08:07:03 2758077 15281 1
4:2017-11-16 09:33:31 156899 448 1
5:2017-11-16 11:00: 04 3639410 1901 2
6:2017-11-16 11:01:50 46274 242 2
7:2017-11-16 12:00:46 3336248 1975 1
8:2017- 11-16 16:31:16 3262457 1006 2
9:2017-11-16 16:37:33 3110064 840 2
10:2017-11-16 17:23:06 3059651 765 3
11:2017-11-16 17:31:57 51569 143 3
12:2017-11-16 17:35:17 2254123 899 3
13:2017-11-16 18: 32:47 2321751 1182 3
14:2017-11-16 19:02:11 469452 1257 3
15:2017-11-16 19:02:23 2248207 923 3
16: 2017-11-16 19:40:25 62245 150 1
17:2017 -11-16 20:54:38 2245512 936 2
18:2017-11-16 21:26:35 4191734 1044 3
19:2017-11-16 21:30:08 2202018 958 2
20:2017-11-16 23:42:37 2434562 2559 1)
setDT(DT)[,c( X1, COUNT):= NULL] []



基准



 #创建数据
w<-2000
nr<-2e3L
set.seed(123)
DT0<-data.table(TIMESTAMP = sort(as.POSIXct ( 2017-11-16)+ sample(w * nr,nr)),
FEAT1 = sample(1e6L,nr,TRUE),
FEAT2 = sample(1e4L,nr,TRUE))

库(microbenchmark)
bm<-microbenchmark(
gautam = {
dt<-copy(DT0)
tme<-dt $ TIMESTAMP
低<-tme-w
高<-tme + w
dt $ count <-sapply(1:length(tme),function(z)return(sum (tme>较低的[z]& tme< upper [z])))
},
cpak = {
dt<-复制(DT0)
dt $ count<-apply(abs(outer(dt $ TIMESTAMP,dt $ TIMESTAMP,-)),2,函数(i)sum(i< w))
},
nej = {
DT <-copy(DT0 )
DT [,COUNT:= DT [。(lb = TIMESTAMP-w,ub = TIMESTAMP + w),on =。(TIMESTAMP> lb,TIMESTAMP },
次= 11L


print(bm)

对于2 K行,Gautam方法和CPak方法的时间约为0.3秒,这与 OP的观察结果。但是,非等额联接快75倍。


 单位:毫秒
expr min lq平均中位数uq最大净值
gautam 269.222847 271.422632 278.196025 273.433669 284.92651 296.377010 11
cpak 269.657768 271.262771 294.905138 273.239062 275.24474 413.822440 3.5 3.973.9203 3.5617 11


此速度优势随问题大小而增加。如果行数翻倍至4 K行,则CPak接近的时间将爆炸到4秒以上。因此,此方法不再包含在进一步的基准测试中。


 单位:毫秒
expr min lq平均中位数uq最大neval
gautam 608.54028 616.835039 620.353603 625.129801 626.260266 627.39073 3
cpak 3901.07978 4044.341991 4168.566908 4187.604199 4302.310471 4417.01674 3
nej 5.43377 6.423414909 b $ c>


对于1万行,非等额联接约为200次比Gautam的方法更快:


 单位:毫秒
expr min lq均值uq max海军
gautam 1914.750641 1921.824429 1947.393708 1928.414685 1945.286100 2048.810492 11
nej 9.142423 9.234898 9.382176 9.467979 9.507603 9.577044 11




如果行数增加一倍(2万行),则Gautam方法的处理时间几乎翻了三倍,达到5.4秒,这与OP所报告的时间一致。但是,现在 non-equi join 方法的速度提高了300倍:


 单位:毫秒
expr min lq平均中位数uq max neval
gautam 5369.1104 5389.17756 5406.53040 5409.24468 5425.2404 5441.23607 3
nej 17.2523 17.25648 17.27597 17.26066 17.2878 17.31495 3


我没有耐心来测试1 M行案例。


I have a data.table containing a time series in POSIXct format and some features (see example below). I want to add a column containing the number of rows within a window of w seconds around the timestamp.

So essentially what I need is this...

dt[, COUNT := 0]
for(i in seq(nrow(dt))) {
    interval_start <- dt[i,(TIMESTAMP - w)]
    interval_end <- dt[i,(TIMESTAMP + w)]
    dt[i, "COUNT"] <- nrow(dt[TIMESTAMP > interval_start & TIMESTAMP < interval_end])
}

... which is obviously very slow and usually not the way to do this. I played around a lot with lapply(), .SD , custom functions passed to lapply(), rolling joins but did not get it to work.

I did not think that something apparently easy would be my first post here.

              TIMESTAMP   FEAT1          FEAT2 COUNT (w = 2000 secs)
 1: 2017-11-16 02:50:19 2332843           1282     2
 2: 2017-11-16 03:01:38   40913            129     2
 3: 2017-11-16 08:07:03 2758077          15281     1
 4: 2017-11-16 09:33:31  156899            448     1
 5: 2017-11-16 11:00:04 3639410           1901     2
 6: 2017-11-16 11:01:50   46274            242     2
 7: 2017-11-16 12:00:46 3336248           1975     1
 8: 2017-11-16 16:31:16 3262457           1006     2
 9: 2017-11-16 16:37:33 3110064            840     2
10: 2017-11-16 17:23:06 3059651            765     3
11: 2017-11-16 17:31:57   51569            143     3
12: 2017-11-16 17:35:17 2254123            899     3
13: 2017-11-16 18:32:47 2321751           1182     3
14: 2017-11-16 19:02:11  469452           1257     3
15: 2017-11-16 19:02:23 2248207            923     3
16: 2017-11-16 19:40:25   62245            150     1
17: 2017-11-16 20:54:38 2245512            936     2
18: 2017-11-16 21:26:35 4191734           1044     3
19: 2017-11-16 21:30:08 2202018            958     2
20: 2017-11-16 23:42:37 2434562           2559     1

UPDATE
For now I'm using @Gautam's approach with some tweaks for more speed. Some experiments showed that this is much faster than the loop (20k rows: 35 vs 7 secs, 100k rows: 408 vs 175 secs).

tme <- dt$tme  
lower <- tme - w  
upper <- tme + w  
dt$count <- sapply(1:length(tme ), function(z) return(sum(tme > lower[z] & tme < upper[z]))) 

Still, if anybody knows the "data.table way" of doing this, I would love to know it.

解决方案

This can also be solved using a non-equi join which seems also to be much faster (see benchmark below)

w <- 2000
DT[, COUNT := DT[.(lb = TIMESTAMP - w, ub = TIMESTAMP + w), 
                 on = .(TIMESTAMP > lb, TIMESTAMP < ub), by = .EACHI, .N]$N][]

              TIMESTAMP   FEAT1 FEAT2 COUNT
 1: 2017-11-16 02:50:19 2332843  1282     2
 2: 2017-11-16 03:01:38   40913   129     2
 3: 2017-11-16 08:07:03 2758077 15281     1
 4: 2017-11-16 09:33:31  156899   448     1
 5: 2017-11-16 11:00:04 3639410  1901     2
 6: 2017-11-16 11:01:50   46274   242     2
 7: 2017-11-16 12:00:46 3336248  1975     1
 8: 2017-11-16 16:31:16 3262457  1006     2
 9: 2017-11-16 16:37:33 3110064   840     2
10: 2017-11-16 17:23:06 3059651   765     3
11: 2017-11-16 17:31:57   51569   143     3
12: 2017-11-16 17:35:17 2254123   899     3
13: 2017-11-16 18:32:47 2321751  1182     3
14: 2017-11-16 19:02:11  469452  1257     3
15: 2017-11-16 19:02:23 2248207   923     3
16: 2017-11-16 19:40:25   62245   150     1
17: 2017-11-16 20:54:38 2245512   936     2
18: 2017-11-16 21:26:35 4191734  1044     3
19: 2017-11-16 21:30:08 2202018   958     2
20: 2017-11-16 23:42:37 2434562  2559     1

Data

DT <- readr::read_table(
"              TIMESTAMP   FEAT1          FEAT2 COUNT
 1: 2017-11-16 02:50:19 2332843           1282     2
 2: 2017-11-16 03:01:38   40913            129     2
 3: 2017-11-16 08:07:03 2758077          15281     1
 4: 2017-11-16 09:33:31  156899            448     1
 5: 2017-11-16 11:00:04 3639410           1901     2
 6: 2017-11-16 11:01:50   46274            242     2
 7: 2017-11-16 12:00:46 3336248           1975     1
 8: 2017-11-16 16:31:16 3262457           1006     2
 9: 2017-11-16 16:37:33 3110064            840     2
10: 2017-11-16 17:23:06 3059651            765     3
11: 2017-11-16 17:31:57   51569            143     3
12: 2017-11-16 17:35:17 2254123            899     3
13: 2017-11-16 18:32:47 2321751           1182     3
14: 2017-11-16 19:02:11  469452           1257     3
15: 2017-11-16 19:02:23 2248207            923     3
16: 2017-11-16 19:40:25   62245            150     1
17: 2017-11-16 20:54:38 2245512            936     2
18: 2017-11-16 21:26:35 4191734           1044     3
19: 2017-11-16 21:30:08 2202018            958     2
20: 2017-11-16 23:42:37 2434562           2559     1")
setDT(DT)[, c("X1", "COUNT") := NULL][]

Benchmark

# create data
w <- 2000
nr <- 2e3L
set.seed(123)
DT0 <- data.table(TIMESTAMP = sort(as.POSIXct("2017-11-16") + sample(w * nr, nr)),
                  FEAT1 = sample(1e6L, nr, TRUE),
                  FEAT2 = sample(1e4L, nr, TRUE))

library(microbenchmark)
bm <- microbenchmark(
  gautam = {
    dt <- copy(DT0)
    tme <- dt$TIMESTAMP  
    lower <- tme - w  
    upper <- tme + w  
    dt$count <- sapply(1:length(tme), function(z) return(sum(tme > lower[z] & tme < upper[z]))) 
  },
  cpak = {
    dt <- copy(DT0)
    dt$count <- apply(abs(outer(dt$TIMESTAMP, dt$TIMESTAMP, "-")), 2, function(i) sum(i < w))
  },
  nej = {
    DT <- copy(DT0)
    DT[, COUNT := DT[.(lb = TIMESTAMP - w, ub = TIMESTAMP + w), on = .(TIMESTAMP > lb, TIMESTAMP < ub), by = .EACHI, .N]$N][]
  },
  times = 11L
)

print(bm)

For 2 K rows, the timings for Gautam's and CPak's approaches are about 0.3 seconds which is in line with OP's observations. However, the non-equi join is 75 times faster.

Unit: milliseconds
   expr        min         lq       mean     median        uq        max neval
 gautam 269.222847 271.422632 278.196025 273.433669 284.92651 296.377010    11
   cpak 269.657768 271.262771 294.905138 273.239062 275.24474 413.822440    11
    nej   3.465766   3.539532   3.620397   3.589308   3.63731   3.901027    11

This speed advantages increases with problem size. If the number of rows is doubled to 4 K rows, the timings of CPak's approach "explode" to over 4 seconds. So, this method is excluded from further benchmarks.

Unit: milliseconds
   expr        min          lq        mean      median          uq        max neval
 gautam  608.54028  616.835039  620.353603  625.129801  626.260266  627.39073     3
   cpak 3901.07978 4044.341991 4168.566908 4187.604199 4302.310471 4417.01674     3
    nej    5.43377    6.423977    8.372348    7.414183    9.841636   12.26909     3

For 10 K rows, the non-equi join is about 200 times faster than Gautam's approach:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 gautam 1914.750641 1921.824429 1947.393708 1928.414685 1945.286100 2048.810492    11
    nej    9.142423    9.234898    9.382176    9.467979    9.507603    9.577044    11

If the number of rows is doubled (20 K rows), the processing time of Gautam's approach nearly triples to 5.4 sec which is in line with the timings reported by the OP. However, the non-equi join approach is now 300 times faster:

Unit: milliseconds
   expr       min         lq       mean     median        uq        max neval
 gautam 5369.1104 5389.17756 5406.53040 5409.24468 5425.2404 5441.23607     3
    nej   17.2523   17.25648   17.27597   17.26066   17.2878   17.31495     3

I didn't had the patience to test the 1 M row case.

这篇关于计算时间戳附近窗口中的行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆