R-在data.table上滚动窗口 [英] R - rolling window over data.table

查看:80
本文介绍了R-在data.table上滚动窗口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下data.table:

I have the following data.table:

          time       id type   price      size  api start.point  end.point
 1: 1399672906 37119594  ASK 440.002 1.4840000 TRUE  1399672606 1399672906
 2: 1399672940 37119597  BID 441.000 0.1758830 TRUE  1399672640 1399672940
 3: 1399672940 37119598  BID 441.000 0.0491166 TRUE  1399672640 1399672940
 4: 1399673105 37119638  ASK 440.002 0.1313700 TRUE  1399672805 1399673105
 5: 1399673198 37119668  BID 441.000 0.0233013 TRUE  1399672898 1399673198
 6: 1399673198 37119669  BID 441.000 0.9744230 TRUE  1399672898 1399673198
 7: 1399673208 37119675  BID 441.000 0.1587060 TRUE  1399672908 1399673208
 8: 1399673208 37119676  BID 441.000 0.1238870 TRUE  1399672908 1399673208
 9: 1399673208 37119677  BID 441.001 0.0100000 TRUE  1399672908 1399673208
10: 1399673208 37119678  BID 441.175 0.0129740 TRUE  1399672908 1399673208
11: 1399673208 37119679  BID 441.192 0.0100000 TRUE  1399672908 1399673208
12: 1399673208 37119680  BID 441.399 0.0129740 TRUE  1399672908 1399673208
13: 1399673208 37119681  BID 441.499 1.7500000 TRUE  1399672908 1399673208
14: 1399673208 37119682  BID 441.500 8.0214600 TRUE  1399672908 1399673208
15: 1399673241 37119691  BID 441.500 0.0453001 TRUE  1399672941 1399673241
16: 1399673274 37119696  ASK 440.030 0.9133460 TRUE  1399672974 1399673274
17: 1399673360 37119705  BID 440.030 0.0580000 TRUE  1399673060 1399673360
18: 1399673433 37119709  ASK 440.002 0.0319611 TRUE  1399673133 1399673433
19: 1399673506 37119711  ASK 440.002 0.2618460 TRUE  1399673206 1399673506
20: 1399673507 37119712  BID 440.002 1.0000000 TRUE  1399673207 1399673507

其中:


  • 时间是unix时间戳

  • id是一项交易交易所分配的数字

  • 起点=时间减去5分钟

  • end.point =实际上等于变量 ti我

  • time is unix timestamp
  • id is a trade number as assigned by the exchange
  • start point = "time" less 5minutes
  • end.point = actually equals to variable "time"

该系列不是等距的。变量start.point和end.point实际上创建了以变量 time结尾的5分钟移动窗口。而且我想计算特定窗口中的交易频率。

The serie is not equidistant. Variables start.point and end.point actually create the 5 minute moving window ending at the variable "time". And I want to calculate the frequency of trades in the particular window.

我已经完成了for循环:

I have it done with the for loop:

for (i in 1:nrow(trades)){

  trades[i, freq := length(unique(trades[time >= start.point[i] & time <= end.point[i]]$id))]

  setTxtProgressBar(status.bar, i)

}

但是,我想知道是否还有一些时尚的data.table方式。
我尝试过类似的操作:

However, I'm wondering if there is some more "fashionable" data.table way. I tried something like:

trades[, freq := list(length(unique(trades[time >= start.point & time <= end.point,]$id))), by = list(id)]

但是结果错误,似乎在每行基础上不起作用:

But the resuls are wrong, it seems it doesn't work on "line-per-line" basis:

            time       id type   price       size  api start.point  end.point freq
  1: 1399672906 37119594  ASK 440.002  1.4840000 TRUE  1399672606 1399672906  100
  2: 1399672940 37119597  BID 441.000  0.1758830 TRUE  1399672640 1399672940  100
  3: 1399672940 37119598  BID 441.000  0.0491166 TRUE  1399672640 1399672940  100
  4: 1399673105 37119638  ASK 440.002  0.1313700 TRUE  1399672805 1399673105  100
  5: 1399673198 37119668  BID 441.000  0.0233013 TRUE  1399672898 1399673198  100
  6: 1399673198 37119669  BID 441.000  0.9744230 TRUE  1399672898 1399673198  100
  7: 1399673208 37119675  BID 441.000  0.1587060 TRUE  1399672908 1399673208  100
  8: 1399673208 37119676  BID 441.000  0.1238870 TRUE  1399672908 1399673208  100
  9: 1399673208 37119677  BID 441.001  0.0100000 TRUE  1399672908 1399673208  100
 10: 1399673208 37119678  BID 441.175  0.0129740 TRUE  1399672908 1399673208  100
 11: 1399673208 37119679  BID 441.192  0.0100000 TRUE  1399672908 1399673208  100

更新:

请参阅以下结构:

structure(list(time = c(1399672906L, 1399673105L, 1399673274L, 
1399673433L, 1399673506L, 1399673531L), id = c(37119594L, 37119638L, 
37119696L, 37119709L, 37119711L, 37119717L), type = c("ASK", 
"ASK", "ASK", "ASK", "ASK", "ASK"), price = c(440.002, 440.002, 
440.03, 440.002, 440.002, 440), size = c(1.484, 0.13137, 0.913346, 
0.0319611, 0.261846, 3.168), api = c(TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE), start.point = c(1399672606, 1399672805, 1399672974, 
1399673133, 1399673206, 1399673231), end.point = c(1399672906L, 
1399673105L, 1399673274L, 1399673433L, 1399673506L, 1399673531L
), freq = c(1L, 4L, 13L, 14L, 13L, 11L)), .Names = c("time", 
"id", "type", "price", "size", "api", "start.point", "end.point", 
"freq"), sorted = c("type", "time"), class = c("data.table", 
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000002e50788>)


推荐答案

I认为使用生物导体包装 IRanges 包,直到在 data.table 中实现间隔联接/范围联接。

I think this can be best accomplished using bioconductor package IRanges package for now, until interval joins / range joins are implemented in data.table.

require(IRanges)
ir1 = IRanges(trades$time, width=1L)
ir2 = IRanges(trades$start.point, trades$end.point)

olaps = findOverlaps(ir1, ir2, type = "within")
dt = data.table(queryHits(olaps), subjectHits(olaps))[, .N, by=V2]

trades[dt$V2, freq := dt$N]

#          time       id type   price      size  api start.point  end.point freq
# 1: 1399672906 37119594  ASK 440.002 1.4840000 TRUE  1399672606 1399672906    1
# 2: 1399673105 37119638  ASK 440.002 0.1313700 TRUE  1399672805 1399673105    2
# 3: 1399673274 37119696  ASK 440.030 0.9133460 TRUE  1399672974 1399673274    2
# 4: 1399673433 37119709  ASK 440.002 0.0319611 TRUE  1399673133 1399673433    2
# 5: 1399673506 37119711  ASK 440.002 0.2618460 TRUE  1399673206 1399673506    3
# 6: 1399673531 37119717  ASK 440.000 3.1680000 TRUE  1399673231 1399673531    4

HTH

这篇关于R-在data.table上滚动窗口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆