R data.table:基于行的条件拆分/应用/组合 [英] R data.table: row-based conditions split/apply/combine

查看:14
本文介绍了R data.table:基于行的条件拆分/应用/组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 data.table

initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-12-28 17:00:00',tz='GMT') 
    PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec')
    TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
    ndf<- data.frame(PriorityDateTime,TradePrice)
    ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
    ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
    ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
    res <- ndf %>% data.table()

看起来像这样:

    > res
         PriorityDateTime TradePrice InstrumentSymbol   id            datetime
   1: 2018-10-27 10:00:00          1           asset1    1 2018-10-27 10:00:00
   2: 2018-10-27 10:00:30          2           asset2    2 2018-10-27 10:00:30
   3: 2018-10-27 10:01:00          3           asset1    3 2018-10-27 10:01:00
   4: 2018-10-27 10:01:30          4           asset2    4 2018-10-27 10:01:30
   5: 2018-10-27 10:02:00          5           asset1    5 2018-10-27 10:02:00

使用 data.table 最优雅、最快捷的方式是:

Using data.table what is the most elegant and fast way to:

  1. 拆分:为每一行定义在过去或未来最多 60 秒(时间差小于 60 秒)具有 datetime 的其他行,并具有与此行的 InstrumentSymbol 相同.
  2. Apply:在这些接近的行中,哪一行的 TradePrice 与该行的 TradePrice[i] 最接近:获取 index 在原始 data.frame 和另一行的 TradePrice
  3. 组合:将结果作为新列重新组合到原始 data.table 中,例如作为新列 index.minpricewithin60minpricewithin60
  1. Split: For each line define the other lines that have a datetime at most 60 secs in the past or future (time difference less than 60secs), and have the same InstrumentSymbol as this line's.
  2. Apply: among these close lines, which one has the closest TradePrice to this line's TradePrice[i]: get the index in the original data.frame and the TradePrice of this other row
  3. Combine: recombine the results as new columns into the original data.table for example as new columns index.minpricewithin60 and minpricewithin60

示例结果:

> res
         PriorityDateTime TradePrice InstrumentSymbol   id            datetime minpricewithin60 index.minpricewithin60
   1: 2018-10-27 10:00:00          1           asset1    1 2018-10-27 10:00:00                2                      2
   2: 2018-10-27 10:00:30          2           asset2    2 2018-10-27 10:00:30                4                      4
   3: 2018-10-27 10:01:00          3           asset1    3 2018-10-27 10:01:00                1                      1
   4: 2018-10-27 10:01:30          4           asset2    4 2018-10-27 10:01:30                2                      2
   5: 2018-10-27 10:02:00          5           asset1    5 2018-10-27 10:02:00                3                      3

base 中,我可以修复一行并将其用于条件.例如,如果我想获得第一个 TradePrice 其中 id 与该行的 id 相同的位置,我可以执行 apply(df,1, 函数(x) df$TradePrice[which(df$id==x["id"])[1]]).您能否解释一下 data.table 的连接(例如)如何实现相同的效果?

In base I can fix a row ans use it for conditions. For example if I want to get the first TradePrice where id is the same as this row's id, I can do apply(df,1, function(x) df$TradePrice[which(df$id==x["id"])[1]]). Could you please explain how data.table's joins (for example) can achieve the same?

数据现在更大,我可以在不到 2.5 分钟的时间内在我的体面 PC(i7 4750 2B,12GB RAM)上运行的任何答案都将被考虑.干杯.

Data is now larger, any answer that I can run on my decent PC (i7 4750 2B, 12GB RAM) in less than 2.5mins will be considered. Cheers.

推荐答案

OP 没有提到新数据集的大小.但是 Rcpp 解决方案应该可以加快速度.

OP did not mention the size of the new dataset. But a Rcpp solution should speed things up.

根据之前的评论:

mtd1 <- function() {
    ndf[, rn:=.I]
    iidx <- ndf[
        .(inst=InstrumentSymbol, prevMin=datetime-60L, nextMin=datetime+60L, idx=id, tp=TradePrice),

        .SD[id != idx, rn[which.min(abs(TradePrice - tp))]],

        by=.EACHI,

        on=.(InstrumentSymbol=inst, datetime>=prevMin, datetime<=nextMin)];

    ndf[, c("minpricewithin60", "index.minpricewithin60") := .SD[iidx$V1, .(TradePrice, id)]]
}

arg0naut 的做法:

arg0naut's approach:

mtd2 <- function() {
    res2[, `:=` (min_60 = datetime - 60, plus_60 = datetime + 60, idx = .I)][
        res2,  on = .(InstrumentSymbol = InstrumentSymbol, datetime >= min_60, datetime <= plus_60), allow.cartesian = TRUE][
            idx != i.idx, .SD[which.min(abs(i.TradePrice - TradePrice))], by = id][
                , .(id, minpricewithin60 = i.TradePrice, index.minpricewithin60 = i.idx)][
                    res, on = .(id)][, `:=` (min_60 = NULL, plus_60 = NULL, idx = NULL)]

}

一种可能的 Rcpp 方法:

A possible Rcpp approach:

library(Rcpp)
cppFunction('
NumericVector nearestPrice(NumericVector id, NumericVector datetime, NumericVector price) {
    int i, j, n = id.size();
    NumericVector res(n);
    double prev, diff;

    for (i=0; i<n; i++) {
        prev = 100000;

        j = i-1;
        while (datetime[j] >= datetime[i]-60 && j>=0) {
            diff = std::abs(price[i] - price[j]);

            if (diff < prev) {
                res[i] = id[j];
                prev = diff;
            }
            j--;
        }

        j = i+1;
        while (datetime[j] <= datetime[i]+60 && j<=n) {
            diff = std::abs(price[i] - price[j]);

            if (diff < prev) {
                res[i] = id[j];
                prev = diff;
            }
            j++;
        }
    }

    return(res);
}
')

mtd3 <- function() {
    setorder(ndf2, InstrumentSymbol, PriorityDateTime)
    iidx <- ndf2[, nearestPrice(.I, datetime, TradePrice), by=.(InstrumentSymbol)]
    ndf2[, c("minpricewithin60", "index.minpricewithin60") := .SD[iidx$V1, .(TradePrice, id)]]
}

计时码:

library(microbenchmark)
microbenchmark(mtd1(), mtd2(), mtd3(), times=3L)

时间安排:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 mtd1() 49447.09713 49457.12408 49528.14395 49467.15103 49568.66737 49670.18371     3
 mtd2() 64189.67241 64343.67138 64656.40058 64497.67034 64889.76466 65281.85899     3
 mtd3()    17.33116    19.58716    22.36557    21.84316    24.88277    27.92238     3

数据:

set.seed(0L)
initial.date <- as.POSIXct('2018-01-01 00:00:00', tz='GMT')
last.date <- initial.date + 30 * (180000/2)
PriorityDateTime <- seq.POSIXt(from=initial.date, to=last.date, by='30 sec')

library(data.table)
ndf <- data.table(PriorityDateTime=c(PriorityDateTime, PriorityDateTime),
    TradePrice=rnorm(length(PriorityDateTime)*2, 100, 20),
    InstrumentSymbol=rep(c('asset1','asset2'), each=length(PriorityDateTime)),
    datetime=c(PriorityDateTime, PriorityDateTime))
setorder(ndf, InstrumentSymbol, PriorityDateTime)[, id := .I]
res <- copy(ndf)
res2  <- copy(ndf)
ndf2 <- copy(ndf)

这篇关于R data.table:基于行的条件拆分/应用/组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆