R data.table:基于行的条件拆分/应用/合并 [英] R data.table: row-based conditions split/apply/combine

查看:60
本文介绍了R data.table:基于行的条件拆分/应用/合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 data.table

initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-12-28 17:00:00',tz='GMT') 
    PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec')
    TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
    ndf<- data.frame(PriorityDateTime,TradePrice)
    ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
    ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
    ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
    res <- ndf %>% data.table()

看起来像

    > res
         PriorityDateTime TradePrice InstrumentSymbol   id            datetime
   1: 2018-10-27 10:00:00          1           asset1    1 2018-10-27 10:00:00
   2: 2018-10-27 10:00:30          2           asset2    2 2018-10-27 10:00:30
   3: 2018-10-27 10:01:00          3           asset1    3 2018-10-27 10:01:00
   4: 2018-10-27 10:01:30          4           asset2    4 2018-10-27 10:01:30
   5: 2018-10-27 10:02:00          5           asset1    5 2018-10-27 10:02:00

使用 data.table 最优雅,最快捷的方法是:

Using data.table what is the most elegant and fast way to:


  1. 分割:对于每一行,定义其他各行在过去或将来最多具有 datetime (时差小于60sec),并且具有相同的 InstrumentSymbol
  2. 应用:在这些关闭行中,最接近该行 TradePrice > TradePr ice [i] :在原始 data.frame 中获取索引,并在<另一行的code> TradePrice

  3. 合并:将结果重新合并为原始 data.table中的新列作为新列 index.minpricewithin60 minpricewithin60

  1. Split: For each line define the other lines that have a datetime at most 60 secs in the past or future (time difference less than 60secs), and have the same InstrumentSymbol as this line's.
  2. Apply: among these close lines, which one has the closest TradePrice to this line's TradePrice[i]: get the index in the original data.frame and the TradePrice of this other row
  3. Combine: recombine the results as new columns into the original data.table for example as new columns index.minpricewithin60 and minpricewithin60

示例结果:

> res
         PriorityDateTime TradePrice InstrumentSymbol   id            datetime minpricewithin60 index.minpricewithin60
   1: 2018-10-27 10:00:00          1           asset1    1 2018-10-27 10:00:00                2                      2
   2: 2018-10-27 10:00:30          2           asset2    2 2018-10-27 10:00:30                4                      4
   3: 2018-10-27 10:01:00          3           asset1    3 2018-10-27 10:01:00                1                      1
   4: 2018-10-27 10:01:30          4           asset2    4 2018-10-27 10:01:30                2                      2
   5: 2018-10-27 10:02:00          5           asset1    5 2018-10-27 10:02:00                3                      3

base 我可以修复一行并将其用于条件。例如,如果我想获取第一个 TradePrice ,其中 id 与该行的相同id ,我可以执行 apply(df,1,function(x)df $ TradePrice [which(df $ id == x [ id])[1]] )。您能否解释一下 data.table 的联接如何实现相同的功能?

In base I can fix a row ans use it for conditions. For example if I want to get the first TradePrice where id is the same as this row's id, I can do apply(df,1, function(x) df$TradePrice[which(df$id==x["id"])[1]]). Could you please explain how data.table's joins (for example) can achieve the same?

编辑:数据现在更大了,任何可以在不到2.5分钟的时间内在我的体面的PC(i7 4750 2B,12GB RAM)上运行的答案都将得到考虑。欢呼。

Data is now larger, any answer that I can run on my decent PC (i7 4750 2B, 12GB RAM) in less than 2.5mins will be considered. Cheers.

推荐答案

OP没有提到新数据集的大小。但是 Rcpp 解决方案应该可以加快速度。

OP did not mention the size of the new dataset. But a Rcpp solution should speed things up.

根据之前的评论:

mtd1 <- function() {
    ndf[, rn:=.I]
    iidx <- ndf[
        .(inst=InstrumentSymbol, prevMin=datetime-60L, nextMin=datetime+60L, idx=id, tp=TradePrice),

        .SD[id != idx, rn[which.min(abs(TradePrice - tp))]],

        by=.EACHI,

        on=.(InstrumentSymbol=inst, datetime>=prevMin, datetime<=nextMin)];

    ndf[, c("minpricewithin60", "index.minpricewithin60") := .SD[iidx$V1, .(TradePrice, id)]]
}

arg0naut的方法:

arg0naut's approach:

mtd2 <- function() {
    res2[, `:=` (min_60 = datetime - 60, plus_60 = datetime + 60, idx = .I)][
        res2,  on = .(InstrumentSymbol = InstrumentSymbol, datetime >= min_60, datetime <= plus_60), allow.cartesian = TRUE][
            idx != i.idx, .SD[which.min(abs(i.TradePrice - TradePrice))], by = id][
                , .(id, minpricewithin60 = i.TradePrice, index.minpricewithin60 = i.idx)][
                    res, on = .(id)][, `:=` (min_60 = NULL, plus_60 = NULL, idx = NULL)]

}

一种可能的Rcpp方法:

A possible Rcpp approach:

library(Rcpp)
cppFunction('
NumericVector nearestPrice(NumericVector id, NumericVector datetime, NumericVector price) {
    int i, j, n = id.size();
    NumericVector res(n);
    double prev, diff;

    for (i=0; i<n; i++) {
        prev = 100000;

        j = i-1;
        while (datetime[j] >= datetime[i]-60 && j>=0) {
            diff = std::abs(price[i] - price[j]);

            if (diff < prev) {
                res[i] = id[j];
                prev = diff;
            }
            j--;
        }

        j = i+1;
        while (datetime[j] <= datetime[i]+60 && j<=n) {
            diff = std::abs(price[i] - price[j]);

            if (diff < prev) {
                res[i] = id[j];
                prev = diff;
            }
            j++;
        }
    }

    return(res);
}
')

mtd3 <- function() {
    setorder(ndf2, InstrumentSymbol, PriorityDateTime)
    iidx <- ndf2[, nearestPrice(.I, datetime, TradePrice), by=.(InstrumentSymbol)]
    ndf2[, c("minpricewithin60", "index.minpricewithin60") := .SD[iidx$V1, .(TradePrice, id)]]
}

定时代码:

library(microbenchmark)
microbenchmark(mtd1(), mtd2(), mtd3(), times=3L)

定时:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 mtd1() 49447.09713 49457.12408 49528.14395 49467.15103 49568.66737 49670.18371     3
 mtd2() 64189.67241 64343.67138 64656.40058 64497.67034 64889.76466 65281.85899     3
 mtd3()    17.33116    19.58716    22.36557    21.84316    24.88277    27.92238     3

数据:

set.seed(0L)
initial.date <- as.POSIXct('2018-01-01 00:00:00', tz='GMT')
last.date <- initial.date + 30 * (180000/2)
PriorityDateTime <- seq.POSIXt(from=initial.date, to=last.date, by='30 sec')

library(data.table)
ndf <- data.table(PriorityDateTime=c(PriorityDateTime, PriorityDateTime),
    TradePrice=rnorm(length(PriorityDateTime)*2, 100, 20),
    InstrumentSymbol=rep(c('asset1','asset2'), each=length(PriorityDateTime)),
    datetime=c(PriorityDateTime, PriorityDateTime))
setorder(ndf, InstrumentSymbol, PriorityDateTime)[, id := .I]
res <- copy(ndf)
res2  <- copy(ndf)
ndf2 <- copy(ndf)

这篇关于R data.table:基于行的条件拆分/应用/合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆