R:使用 data.table := 操作来计算新列 [英] R: using data.table := operations to calculate new columns

查看:18
本文介绍了R:使用 data.table := 操作来计算新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们获取以下数据:

dt <- data.table(TICKER=c(rep("ABC",10),"DEF"),
        PERIOD=c(rep(as.Date("2010-12-31"),10),as.Date("2011-12-31")),
        DATE=as.Date(c("2010-01-05","2010-01-07","2010-01-08","2010-01-09","2010-01-10","2010-01-11","2010-01-13","2010-04-01","2010-04-02","2010-08-03","2011-02-05")),
        ID=c(1,2,1,3,1,2,1,1,2,2,1),VALUE=c(1.5,1.3,1.4,1.6,1.4,1.2,1.5,1.7,1.8,1.7,2.3))
setkey(dt,TICKER,PERIOD,ID,DATE)

现在对于每个代码/周期组合,我需要在新列中添加以下内容:

Now for each ticker/period combination, I need the following in a new column:

  • PRIORAVG:每个 ID 的最新 VALUE 的平均值,不包括当前 ID,前提是它不超过 180 天.
  • PREV:来自同一 ID 的上一个值.
  • PRIORAVG: The mean of the latest VALUE of each ID, excluding the current ID, providing it is no more than 180 days old.
  • PREV: The previous value from the same ID.

结果应该是这样的:

      TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
 [1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
 [2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
 [3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
 [4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
 [5,]    ABC 2010-12-31 2010-04-01  1   1.7     1.40  1.5
 [6,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
 [7,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
 [8,]    ABC 2010-12-31 2010-04-02  2   1.8     1.65  1.2
 [9,]    ABC 2010-12-31 2010-08-03  2   1.7     1.70  1.8
[10,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[11,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

请注意,第 9 行的 PRIORAVG 等于 1.7(等于第 5 行的 VALUE,这是过去 180 天内唯一的先前观察值)另一个ID)

Note the PRIORAVG on row 9 is equal to 1.7 (which is equal to the VALUE on row 5, which is the only prior observation in the past 180 days by another ID)

我发现了 data.table 包,但我似乎无法完全理解 := 函数.当我保持简单时,它似乎有效.获取每个 ID 的先前值(我基于 的解决方案这个问题):

I have discovered the data.table package, but I can't seem to fully understand the := function. When I keep it simple, it seems to work. To obtain the previous value for each ID (I based this on the solution to this question):

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),roll=TRUE,mult="last"][,VALUE]]

这很好用,只需 0.13 秒即可对我的数据集执行此操作,其中包含约 250k 行;我的矢量扫描函数得到了相同的结果,但速度慢了大约 30,000 倍.

This works great, and it only takes 0.13 seconds to perform this operation over my dataset with ~250k rows; my vector scan function gets identical results but is about 30,000 times slower.

好的,所以我有我的第一个要求.让我们来看看第二个更复杂的需求.目前对我来说,目前禁食的方法是使用几个矢量扫描并通过 plyr 函数 adply 抛出该函数以获得每一行的结果.

Ok, so I've got my first requirement. Let's get to the second, more complex requirement. Right now the fasted method so far for me is using a couple of vector scans and throwing the function through the plyr function adply to get the result for each row.

calc <- function(df,ticker,period,id,date) {
  df <- df[df$TICKER == ticker & df$PERIOD == period 
        & df$ID != id & df$DATE < date & df$DATE > date-180, ]
  df <- df[order(df$DATE),]
  mean(df[!duplicated(df$ID, fromLast = TRUE),"VALUE"])
}

df <- data.frame(dt)
adply(df,1,function(x) calc(df,x$TICKER,x$PERIOD,x$ID,x$DATE))

我为 data.frame 编写了函数,但它似乎不适用于 data.table.对于 5000 行的子集,这大约需要 44 秒,但我的数据包含超过 100 万行.我想知道是否可以通过使用 := 来提高效率.

I wrote the function for a data.frame and it does not seem to work with a data.table. For a subset of 5000 rows this takes about 44 seconds but my data consists of > 1 million rows. I wonder if this can be made more efficient through the usage of :=.

dt[J("ABC"),last(VALUE),by=ID][,mean(V1)]

这可以为 ABC 的每个 ID 选择最新 VALUE 的平均值.

This works to select the average of the latest VALUEs for each ID for ABC.

dt[,PRIORAVG:=dt[J(TICKER,PERIOD),last(VALUE),by=ID][,mean(V1)]]

但是,这并没有按预期工作,因为它取所有代码/周期的所有最后 VALUE 的平均值,而不仅仅是当前代码/周期.所以最终所有行都获得相同的平均值.我做错了什么还是 := 的限制?

This, however, does not work as expected, as it takes the average of all last VALUEs for all ticker/periods instead of only for the current ticker/period. So it ends up with all rows getting the same mean value. Am I doing something wrong or is this a limitation of := ?

推荐答案

好问题.试试这个:

dt
     TICKER     PERIOD       DATE ID VALUE
[1,]    ABC 2010-12-31 2010-01-05  1   1.5
[2,]    ABC 2010-12-31 2010-01-08  1   1.4
[3,]    ABC 2010-12-31 2010-01-10  1   1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5
[5,]    ABC 2010-12-31 2010-01-07  2   1.3
[6,]    ABC 2010-12-31 2010-01-11  2   1.2
[7,]    ABC 2010-12-31 2010-01-09  3   1.6
[8,]    DEF 2011-12-31 2011-02-05  1   2.3

ids = unique(dt$ID)
dt[,PRIORAVG:=NA_real_]
for (i in 1:nrow(dt))
    dt[i,PRIORAVG:=dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
                      mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"]]
dt
     TICKER     PERIOD       DATE ID VALUE PRIORAVG
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA

那么你已经有了一点简化......

Then what you had already with a slight simplification ...

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),VALUE,roll=TRUE,mult="last"]]

     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

如果这可以作为原型,那么很大的速度改进将是保持循环但使用 set() 而不是 :=,以减少开销:

If this is ok as a prototype then a large speed improvement would be to keep the loop but use set() instead of :=, to reduce overhead :

for (i in 1:nrow(dt))
    set(dt,i,6L,dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
                   mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"])
dt
     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

这应该比问题中显示的重复矢量扫描快很多.

That should be a lot faster than the repeated vector scans shown in the question.

或者,可以对操作进行矢量化.但由于此任务的特性,这将不太容易编写和阅读.

Or, the operation could be vectorized. But that would be less easy to write and read due to the features of this task.

顺便说一句,问题中没有任何数据可以测试 180 天的要求.如果您添加一些并再次显示预期输出,那么我将使用我在评论中提到的连接继承范围添加年龄计算.

Btw, there isn't any data in the question that would test the 180 day requirement. If you add some and show expected output again then I'll add the calculation of age using join inherited scope I mentioned in comments.

这篇关于R:使用 data.table := 操作来计算新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆