R:使用data.table:=操作来计算新列 [英] R: using data.table := operations to calculate new columns

查看:115
本文介绍了R:使用data.table:=操作来计算新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们来看下面的数据:

Let's take the following data:

dt <- data.table(TICKER=c(rep("ABC",10),"DEF"),
        PERIOD=c(rep(as.Date("2010-12-31"),10),as.Date("2011-12-31")),
        DATE=as.Date(c("2010-01-05","2010-01-07","2010-01-08","2010-01-09","2010-01-10","2010-01-11","2010-01-13","2010-04-01","2010-04-02","2010-08-03","2011-02-05")),
        ID=c(1,2,1,3,1,2,1,1,2,2,1),VALUE=c(1.5,1.3,1.4,1.6,1.4,1.2,1.5,1.7,1.8,1.7,2.3))
setkey(dt,TICKER,PERIOD,ID,DATE)


$ b b

现在对于每个股票/期货组合,我需要在新列中添加以下内容:

Now for each ticker/period combination, I need the following in a new column:


  • PRIORAVG :每个ID的最新VALUE的平均值,不包括当前ID,前提是它不超过180天。

  • PREV :来自同一ID的上一个值。

  • PRIORAVG: The mean of the latest VALUE of each ID, excluding the current ID, providing it is no more than 180 days old.
  • PREV: The previous value from the same ID.

结果应如下所示:

      TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
 [1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
 [2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
 [3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
 [4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
 [5,]    ABC 2010-12-31 2010-04-01  1   1.7     1.40  1.5
 [6,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
 [7,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
 [8,]    ABC 2010-12-31 2010-04-02  2   1.8     1.65  1.2
 [9,]    ABC 2010-12-31 2010-08-03  2   1.7     1.70  1.8
[10,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[11,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

请注意,第9行上的 PRIORAVG 等于1.7(等于 VALUE 在第5行,这是过去180天中另一个 ID 的唯一先前观察)

Note the PRIORAVG on row 9 is equal to 1.7 (which is equal to the VALUE on row 5, which is the only prior observation in the past 180 days by another ID)

我发现了 data.table 包,但我似乎不能完全理解:= 函数。当我保持简单,它似乎工作。要获取每个ID的上一个值(我基于的解决方案)这个问题):

I have discovered the data.table package, but I can't seem to fully understand the := function. When I keep it simple, it seems to work. To obtain the previous value for each ID (I based this on the solution to this question):

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),roll=TRUE,mult="last"][,VALUE]]

,并且它只需要0.13秒来对我的数据集执行这个操作与〜250k行;我的矢量扫描函数得到相同的结果,但是大约慢了30,000倍。

This works great, and it only takes 0.13 seconds to perform this operation over my dataset with ~250k rows; my vector scan function gets identical results but is about 30,000 times slower.

好,所以我有我的第一个要求。让我们来看第二个更复杂的要求。现在对我来说,禁食方法是使用几个向量扫描并通过 plyr 函数 adply

Ok, so I've got my first requirement. Let's get to the second, more complex requirement. Right now the fasted method so far for me is using a couple of vector scans and throwing the function through the plyr function adply to get the result for each row.

calc <- function(df,ticker,period,id,date) {
  df <- df[df$TICKER == ticker & df$PERIOD == period 
        & df$ID != id & df$DATE < date & df$DATE > date-180, ]
  df <- df[order(df$DATE),]
  mean(df[!duplicated(df$ID, fromLast = TRUE),"VALUE"])
}

df <- data.frame(dt)
adply(df,1,function(x) calc(df,x$TICKER,x$PERIOD,x$ID,x$DATE))

我写了一个 data.frame的函数,它似乎不工作与 data.table 。对于5000行的子集,这需要大约44秒,但我的数据包括> 100万行。我不知道这是否可以通过使用:= 更高效。

I wrote the function for a data.frame and it does not seem to work with a data.table. For a subset of 5000 rows this takes about 44 seconds but my data consists of > 1 million rows. I wonder if this can be made more efficient through the usage of :=.

dt[J("ABC"),last(VALUE),by=ID][,mean(V1)]

这用于为ABC的每个ID选择最近的VALUE的平均值。

This works to select the average of the latest VALUEs for each ID for ABC.

dt[,PRIORAVG:=dt[J(TICKER,PERIOD),last(VALUE),by=ID][,mean(V1)]]

但是,这不会按预期工作,因为它需要所有股票/期间的所有最后VALUE的平均值,而不仅仅是当前股票/期间。因此,最终所有行获得相同的平均值。我做错了,或者是:=

This, however, does not work as expected, as it takes the average of all last VALUEs for all ticker/periods instead of only for the current ticker/period. So it ends up with all rows getting the same mean value. Am I doing something wrong or is this a limitation of := ?

推荐答案

dt
     TICKER     PERIOD       DATE ID VALUE
[1,]    ABC 2010-12-31 2010-01-05  1   1.5
[2,]    ABC 2010-12-31 2010-01-08  1   1.4
[3,]    ABC 2010-12-31 2010-01-10  1   1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5
[5,]    ABC 2010-12-31 2010-01-07  2   1.3
[6,]    ABC 2010-12-31 2010-01-11  2   1.2
[7,]    ABC 2010-12-31 2010-01-09  3   1.6
[8,]    DEF 2011-12-31 2011-02-05  1   2.3

ids = unique(dt$ID)
dt[,PRIORAVG:=NA_real_]
for (i in 1:nrow(dt))
    dt[i,PRIORAVG:=dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
                      mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"]]
dt
     TICKER     PERIOD       DATE ID VALUE PRIORAVG
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA

那么你已经有了一些简单的...

Then what you had already with a slight simplification ...

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),VALUE,roll=TRUE,mult="last"]]

     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

如果这可以作为原型,那么大的速度改进将是保持循环,但使用 set()而不是:= ,以减少开销:

If this is ok as a prototype then a large speed improvement would be to keep the loop but use set() instead of :=, to reduce overhead :

for (i in 1:nrow(dt))
    set(dt,i,6L,dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
                   mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"])
dt
     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

这应该比问题中显示的重复向量扫描快很多。

That should be a lot faster than the repeated vector scans shown in the question.

或者,可以将操作向量化。但是,由于这个任务的特点,写入和读取不太容易。

Or, the operation could be vectorized. But that would be less easy to write and read due to the features of this task.

Btw,问题中没有任何数据可以测试180天需求。如果你添加一些并显示期望的输出再次,我将添加年龄使用连接继承范围我在评论中提到的计算。

Btw, there isn't any data in the question that would test the 180 day requirement. If you add some and show expected output again then I'll add the calculation of age using join inherited scope I mentioned in comments.

这篇关于R:使用data.table:=操作来计算新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆