合并范围的端点与序列 [英] merging endpoints of a range with a sequence

查看:88
本文介绍了合并范围的端点与序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的应用程序中有一段代码,用于从 data.table 对象中检索信息,具体取决于另一个中的值。

In one of my application there is a piece of code that retrieve information from a data.table object depending on values in another.

# say this table contains customers details
dt <- data.table(id=LETTERS[1:4],
                 start=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month"),
                 end=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month") + c(6,8,10,5),
                 key="id")

# this one has some historical details
dt1 <- data.table(id=rep(LETTERS[1:4], each=120),
                  date=seq(as.Date("2010-01-01"), as.Date("2010-04-30"), "day"),
                  var=rnorm(120),
                  key="id,date")

# and here I finally retrieve my historical information based one customer detail
#
library(data.table)

myfunc <- function(x) {
  # some code
  period <- seq(x$start, x$end, "day")
  dt1[.(x$id, period)][, mean(var)]
  # some code
}

使用 adply

library(plyr)
library(microbenchmark)
> adply(dt, 1, myfunc)
   id      start        end         V1
1:  A 2010-01-01 2010-01-07  0.3143536
2:  B 2010-02-01 2010-02-09 -0.5796084
3:  C 2010-03-01 2010-03-11  0.1171404
4:  D 2010-04-01 2010-04-06  0.2384237

> microbenchmark(adply(dt, 1, myfunc))
Unit: milliseconds
                 expr      min       lq   median       uq      max neval
 adply(dt, 1, myfunc) 8.812486 8.998338 9.105776 9.223637 88.14057   100

你知道一种避免 adply 在一个 data.table 语句中执行上述操作?或者反正更快的方法?

Do you know a way to avoid the adply call and do the above in one data.table statement? Or anyway a faster method? (title edit suggestion more than welcome, I couldn't think a better one, thanks)

推荐答案

这是一个很好的地方使用 data.table roll 参数:

This is a great spot to use the roll argument of data.table:

setkey(dt1, id, date)
setkey(dt, id, start)

dt[dt1, roll = TRUE][end >= start,
   list(start = start[1], end = end[1], result = mean(var)), by = id]

# benchmark
microbenchmark(OP    = adply(dt, 1, myfunc),
               Frank = dt[dt1[as.list(dt[,seq.Date(start,end,"day"),by="id"])][,mean(var),by=id]],
               eddi  = dt[dt1, roll = TRUE][end >= start,list(start = start[1], end = end[1], result = mean(var)), by = id])
#Unit: milliseconds
#  expr       min        lq    median        uq       max neval
#    OP 24.436126 29.184786 30.853094 32.493521 50.898664   100
# Frank  9.115676 11.303691 12.081000 13.122753 28.370415   100
#  eddi  5.336315  6.323643  6.771898  7.497285  9.531376   100

时差将随着数据集的大小的增长而变得更加显着。

The time difference will become much more dramatic as the size of the datasets grows.

这篇关于合并范围的端点与序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆