R data.table:基于行的条件拆分/应用/组合 [英] R data.table: row-based conditions split/apply/combine
问题描述
我有以下 data.table
initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-12-28 17:00:00',tz='GMT')
PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec')
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.table()
看起来像这样:
> res
PriorityDateTime TradePrice InstrumentSymbol id datetime
1: 2018-10-27 10:00:00 1 asset1 1 2018-10-27 10:00:00
2: 2018-10-27 10:00:30 2 asset2 2 2018-10-27 10:00:30
3: 2018-10-27 10:01:00 3 asset1 3 2018-10-27 10:01:00
4: 2018-10-27 10:01:30 4 asset2 4 2018-10-27 10:01:30
5: 2018-10-27 10:02:00 5 asset1 5 2018-10-27 10:02:00
使用 data.table
最优雅、最快捷的方式是:
Using data.table
what is the most elegant and fast way to:
- 拆分:为每一行定义在过去或未来最多 60 秒(时间差小于 60 秒)具有
datetime
的其他行,并具有与此行的InstrumentSymbol
相同. - Apply:在这些接近的行中,哪一行的
TradePrice
与该行的TradePrice[i]
最接近:获取index
在原始data.frame
和另一行的TradePrice
- 组合:将结果作为新列重新组合到原始
data.table
中,例如作为新列index.minpricewithin60
和minpricewithin60
李>
- Split: For each line define the other lines that have a
datetime
at most 60 secs in the past or future (time difference less than 60secs), and have the sameInstrumentSymbol
as this line's. - Apply: among these close lines, which one has the closest
TradePrice
to this line'sTradePrice[i]
: get theindex
in the originaldata.frame
and theTradePrice
of this other row - Combine: recombine the results as new columns into the original
data.table
for example as new columnsindex.minpricewithin60
andminpricewithin60
示例结果:
> res
PriorityDateTime TradePrice InstrumentSymbol id datetime minpricewithin60 index.minpricewithin60
1: 2018-10-27 10:00:00 1 asset1 1 2018-10-27 10:00:00 2 2
2: 2018-10-27 10:00:30 2 asset2 2 2018-10-27 10:00:30 4 4
3: 2018-10-27 10:01:00 3 asset1 3 2018-10-27 10:01:00 1 1
4: 2018-10-27 10:01:30 4 asset2 4 2018-10-27 10:01:30 2 2
5: 2018-10-27 10:02:00 5 asset1 5 2018-10-27 10:02:00 3 3
在 base
中,我可以修复一行并将其用于条件.例如,如果我想获得第一个 TradePrice
其中 id
与该行的 id
相同的位置,我可以执行 apply(df,1, 函数(x) df$TradePrice[which(df$id==x["id"])[1]])
.您能否解释一下 data.table
的连接(例如)如何实现相同的效果?
In base
I can fix a row ans use it for conditions. For example if I want to get the first TradePrice
where id
is the same as this row's id
, I can do apply(df,1, function(x) df$TradePrice[which(df$id==x["id"])[1]])
. Could you please explain how data.table
's joins (for example) can achieve the same?
数据现在更大,我可以在不到 2.5 分钟的时间内在我的体面 PC(i7 4750 2B,12GB RAM)上运行的任何答案都将被考虑.干杯.
Data is now larger, any answer that I can run on my decent PC (i7 4750 2B, 12GB RAM) in less than 2.5mins will be considered. Cheers.
推荐答案
OP 没有提到新数据集的大小.但是 Rcpp
解决方案应该可以加快速度.
OP did not mention the size of the new dataset. But a Rcpp
solution should speed things up.
根据之前的评论:
mtd1 <- function() {
ndf[, rn:=.I]
iidx <- ndf[
.(inst=InstrumentSymbol, prevMin=datetime-60L, nextMin=datetime+60L, idx=id, tp=TradePrice),
.SD[id != idx, rn[which.min(abs(TradePrice - tp))]],
by=.EACHI,
on=.(InstrumentSymbol=inst, datetime>=prevMin, datetime<=nextMin)];
ndf[, c("minpricewithin60", "index.minpricewithin60") := .SD[iidx$V1, .(TradePrice, id)]]
}
arg0naut 的做法:
arg0naut's approach:
mtd2 <- function() {
res2[, `:=` (min_60 = datetime - 60, plus_60 = datetime + 60, idx = .I)][
res2, on = .(InstrumentSymbol = InstrumentSymbol, datetime >= min_60, datetime <= plus_60), allow.cartesian = TRUE][
idx != i.idx, .SD[which.min(abs(i.TradePrice - TradePrice))], by = id][
, .(id, minpricewithin60 = i.TradePrice, index.minpricewithin60 = i.idx)][
res, on = .(id)][, `:=` (min_60 = NULL, plus_60 = NULL, idx = NULL)]
}
一种可能的 Rcpp 方法:
A possible Rcpp approach:
library(Rcpp)
cppFunction('
NumericVector nearestPrice(NumericVector id, NumericVector datetime, NumericVector price) {
int i, j, n = id.size();
NumericVector res(n);
double prev, diff;
for (i=0; i<n; i++) {
prev = 100000;
j = i-1;
while (datetime[j] >= datetime[i]-60 && j>=0) {
diff = std::abs(price[i] - price[j]);
if (diff < prev) {
res[i] = id[j];
prev = diff;
}
j--;
}
j = i+1;
while (datetime[j] <= datetime[i]+60 && j<=n) {
diff = std::abs(price[i] - price[j]);
if (diff < prev) {
res[i] = id[j];
prev = diff;
}
j++;
}
}
return(res);
}
')
mtd3 <- function() {
setorder(ndf2, InstrumentSymbol, PriorityDateTime)
iidx <- ndf2[, nearestPrice(.I, datetime, TradePrice), by=.(InstrumentSymbol)]
ndf2[, c("minpricewithin60", "index.minpricewithin60") := .SD[iidx$V1, .(TradePrice, id)]]
}
计时码:
library(microbenchmark)
microbenchmark(mtd1(), mtd2(), mtd3(), times=3L)
时间安排:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 49447.09713 49457.12408 49528.14395 49467.15103 49568.66737 49670.18371 3
mtd2() 64189.67241 64343.67138 64656.40058 64497.67034 64889.76466 65281.85899 3
mtd3() 17.33116 19.58716 22.36557 21.84316 24.88277 27.92238 3
数据:
set.seed(0L)
initial.date <- as.POSIXct('2018-01-01 00:00:00', tz='GMT')
last.date <- initial.date + 30 * (180000/2)
PriorityDateTime <- seq.POSIXt(from=initial.date, to=last.date, by='30 sec')
library(data.table)
ndf <- data.table(PriorityDateTime=c(PriorityDateTime, PriorityDateTime),
TradePrice=rnorm(length(PriorityDateTime)*2, 100, 20),
InstrumentSymbol=rep(c('asset1','asset2'), each=length(PriorityDateTime)),
datetime=c(PriorityDateTime, PriorityDateTime))
setorder(ndf, InstrumentSymbol, PriorityDateTime)[, id := .I]
res <- copy(ndf)
res2 <- copy(ndf)
ndf2 <- copy(ndf)
这篇关于R data.table:基于行的条件拆分/应用/组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!