R data.table:基于行的条件拆分/应用/合并 [英] R data.table: row-based conditions split/apply/combine
问题描述
我有以下 data.table
initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-12-28 17:00:00',tz='GMT')
PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec')
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.table()
看起来像
> res
PriorityDateTime TradePrice InstrumentSymbol id datetime
1: 2018-10-27 10:00:00 1 asset1 1 2018-10-27 10:00:00
2: 2018-10-27 10:00:30 2 asset2 2 2018-10-27 10:00:30
3: 2018-10-27 10:01:00 3 asset1 3 2018-10-27 10:01:00
4: 2018-10-27 10:01:30 4 asset2 4 2018-10-27 10:01:30
5: 2018-10-27 10:02:00 5 asset1 5 2018-10-27 10:02:00
使用 data.table
最优雅,最快捷的方法是:
Using data.table
what is the most elegant and fast way to:
- 分割:对于每一行,定义其他各行在过去或将来最多具有
datetime
(时差小于60sec),并且具有相同的InstrumentSymbol
- 应用:在这些关闭行中,最接近该行
TradePrice
> TradePr ice [i] :在原始data.frame
中获取索引
,并在<另一行的code> TradePrice - 合并:将结果重新合并为原始
data.table中的新列例如,将code>作为新列
index.minpricewithin60
和minpricewithin60
- Split: For each line define the other lines that have a
datetime
at most 60 secs in the past or future (time difference less than 60secs), and have the sameInstrumentSymbol
as this line's. - Apply: among these close lines, which one has the closest
TradePrice
to this line'sTradePrice[i]
: get theindex
in the originaldata.frame
and theTradePrice
of this other row - Combine: recombine the results as new columns into the original
data.table
for example as new columnsindex.minpricewithin60
andminpricewithin60
示例结果:
> res
PriorityDateTime TradePrice InstrumentSymbol id datetime minpricewithin60 index.minpricewithin60
1: 2018-10-27 10:00:00 1 asset1 1 2018-10-27 10:00:00 2 2
2: 2018-10-27 10:00:30 2 asset2 2 2018-10-27 10:00:30 4 4
3: 2018-10-27 10:01:00 3 asset1 3 2018-10-27 10:01:00 1 1
4: 2018-10-27 10:01:30 4 asset2 4 2018-10-27 10:01:30 2 2
5: 2018-10-27 10:02:00 5 asset1 5 2018-10-27 10:02:00 3 3
在 base
我可以修复一行并将其用于条件。例如,如果我想获取第一个 TradePrice
,其中 id
与该行的相同id
,我可以执行 apply(df,1,function(x)df $ TradePrice [which(df $ id == x [ id])[1]] )
。您能否解释一下 data.table
的联接如何实现相同的功能?
In base
I can fix a row ans use it for conditions. For example if I want to get the first TradePrice
where id
is the same as this row's id
, I can do apply(df,1, function(x) df$TradePrice[which(df$id==x["id"])[1]])
. Could you please explain how data.table
's joins (for example) can achieve the same?
编辑:数据现在更大了,任何可以在不到2.5分钟的时间内在我的体面的PC(i7 4750 2B,12GB RAM)上运行的答案都将得到考虑。欢呼。
Data is now larger, any answer that I can run on my decent PC (i7 4750 2B, 12GB RAM) in less than 2.5mins will be considered. Cheers.
推荐答案
OP没有提到新数据集的大小。但是 Rcpp
解决方案应该可以加快速度。
OP did not mention the size of the new dataset. But a Rcpp
solution should speed things up.
根据之前的评论:
mtd1 <- function() {
ndf[, rn:=.I]
iidx <- ndf[
.(inst=InstrumentSymbol, prevMin=datetime-60L, nextMin=datetime+60L, idx=id, tp=TradePrice),
.SD[id != idx, rn[which.min(abs(TradePrice - tp))]],
by=.EACHI,
on=.(InstrumentSymbol=inst, datetime>=prevMin, datetime<=nextMin)];
ndf[, c("minpricewithin60", "index.minpricewithin60") := .SD[iidx$V1, .(TradePrice, id)]]
}
arg0naut的方法:
arg0naut's approach:
mtd2 <- function() {
res2[, `:=` (min_60 = datetime - 60, plus_60 = datetime + 60, idx = .I)][
res2, on = .(InstrumentSymbol = InstrumentSymbol, datetime >= min_60, datetime <= plus_60), allow.cartesian = TRUE][
idx != i.idx, .SD[which.min(abs(i.TradePrice - TradePrice))], by = id][
, .(id, minpricewithin60 = i.TradePrice, index.minpricewithin60 = i.idx)][
res, on = .(id)][, `:=` (min_60 = NULL, plus_60 = NULL, idx = NULL)]
}
一种可能的Rcpp方法:
A possible Rcpp approach:
library(Rcpp)
cppFunction('
NumericVector nearestPrice(NumericVector id, NumericVector datetime, NumericVector price) {
int i, j, n = id.size();
NumericVector res(n);
double prev, diff;
for (i=0; i<n; i++) {
prev = 100000;
j = i-1;
while (datetime[j] >= datetime[i]-60 && j>=0) {
diff = std::abs(price[i] - price[j]);
if (diff < prev) {
res[i] = id[j];
prev = diff;
}
j--;
}
j = i+1;
while (datetime[j] <= datetime[i]+60 && j<=n) {
diff = std::abs(price[i] - price[j]);
if (diff < prev) {
res[i] = id[j];
prev = diff;
}
j++;
}
}
return(res);
}
')
mtd3 <- function() {
setorder(ndf2, InstrumentSymbol, PriorityDateTime)
iidx <- ndf2[, nearestPrice(.I, datetime, TradePrice), by=.(InstrumentSymbol)]
ndf2[, c("minpricewithin60", "index.minpricewithin60") := .SD[iidx$V1, .(TradePrice, id)]]
}
定时代码:
library(microbenchmark)
microbenchmark(mtd1(), mtd2(), mtd3(), times=3L)
定时:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 49447.09713 49457.12408 49528.14395 49467.15103 49568.66737 49670.18371 3
mtd2() 64189.67241 64343.67138 64656.40058 64497.67034 64889.76466 65281.85899 3
mtd3() 17.33116 19.58716 22.36557 21.84316 24.88277 27.92238 3
数据:
set.seed(0L)
initial.date <- as.POSIXct('2018-01-01 00:00:00', tz='GMT')
last.date <- initial.date + 30 * (180000/2)
PriorityDateTime <- seq.POSIXt(from=initial.date, to=last.date, by='30 sec')
library(data.table)
ndf <- data.table(PriorityDateTime=c(PriorityDateTime, PriorityDateTime),
TradePrice=rnorm(length(PriorityDateTime)*2, 100, 20),
InstrumentSymbol=rep(c('asset1','asset2'), each=length(PriorityDateTime)),
datetime=c(PriorityDateTime, PriorityDateTime))
setorder(ndf, InstrumentSymbol, PriorityDateTime)[, id := .I]
res <- copy(ndf)
res2 <- copy(ndf)
ndf2 <- copy(ndf)
这篇关于R data.table:基于行的条件拆分/应用/合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!