用于滞后回归的 R data.table 分组 [英] R data.table grouping for lagged regression

查看:16
本文介绍了用于滞后回归的 R data.table 分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

带有数据的表(它是一个 data.table 对象),如下所示:

table with data (its a data.table object) that looks like the following :

      date         stock_id logret
   1: 2011-01-01        1  0.001
   2: 2011-01-02        1  0.003
   3: 2011-01-03        1  0.005
   4: 2011-01-04        1  0.007
   5: 2011-01-05        1  0.009
   6: 2011-01-06        1  0.011
   7: 2011-01-01        2  0.013
   8: 2011-01-02        2  0.015
   9: 2011-01-03        2  0.017
  10: 2011-01-04        2  0.019
  11: 2011-01-05        2  0.021
  12: 2011-01-06        2  0.023
  13: 2011-01-01        3  0.025
  14: 2011-01-02        3  0.027
  15: 2011-01-03        3  0.029
  16: 2011-01-04        3  0.031
  17: 2011-01-05        3  0.033
  18: 2011-01-06        3  0.035

上面可以创建为:

DT = data.table(
   date=rep(as.Date('2011-01-01')+0:5,3) , 
   stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
  logret=seq(0.001, by=0.002, len=18));

setkeyv(DT,c('stock_id','date'))

当然,实际表更大,包含更多的 stock_id 和日期.旨在重塑此数据表,以便我可以对所有 stockid log_returns 及其相应的 log_returns 进行回归,滞后 1 天(或周末的前交易日).

Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).

最终结果如下所示:

      date         stock_id logret lagret
   1: 2011-01-01        1  0.001    NA
   2: 2011-01-02        1  0.003    0.001
   3: 2011-01-03        1  0.005    0.003
   ....
  16: 2011-01-04        3  0.031  0.029
  17: 2011-01-05        3  0.033  0.031
  18: 2011-01-06        3  0.035  0.033

我发现在不混淆我的 stockid 的情况下构建这个数据结构真的很棘手.

I'm finding this data structure really tricky to build without mixing up my stockid.

推荐答案

由于 Alex 的评论,只是一些额外的注释.你很难理解这里发生了什么的原因是很多事情都是在一行中完成的.因此,分解事物总是一个好主意.

Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.

我们真正想要的是什么?我们想要一个新列 lagret 并且在 data.table 中添加一个新列的语法如下:

What do we actually want? We want a new column lagret and the syntax to add a new column in data.table is the following:

DT[, lagret := xxx]

其中 xxx 必须在 lagret 列中填写您想要的任何内容.因此,如果我们只想要一个新列来为我们提供行,我们可以调用

where xxx has to be filled up with whatever you want to have in column lagret. So if we just want a new column that gives us the rows, we could just call

DT[, lagret := seq(from=1, to=nrow(DT))]

这里,我们实际上想要logret的滞后值,但是我们必须考虑到这里有很多股票.这就是我们进行自连接的原因,即我们通过列 stock_iddate 将 data.table DT 与自身连接起来,但是因为我们想要每只股票的前值,我们使用 date-1.请注意,我们必须先设置键才能进行这样的连接:

Here, we actually want the lagged value of logret, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT with itself by the columns stock_id and date, but since we want the previous value of each stock, we use date-1. Note that we have to set the keys first to do such a join:

setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
    stock_id       date logret
 1:        1 2010-12-31     NA
 2:        1 2011-01-01  0.001
 3:        1 2011-01-02  0.003
 4:        1 2011-01-03  0.005
 5:        1 2011-01-04  0.007
 6:        1 2011-01-05  0.009
...

如您所见,我们现在拥有了我们想要的东西.logret 现在滞后一个周期.但我们实际上希望在 DT 中的新列 lagret 中获得该列,所以我们只需通过调用 [[3L]] 来获取该列(这意味着没有别的意思,然后给我第三个列)并将这个新列命名为 lagret:

As you can see, we now have what we want. logret is now lagged by one period. But we actually want that in a new column lagret in DT, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret:

DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-04        1  0.007  0.005
 5: 2011-01-05        1  0.009  0.007
...

这已经是正确的解决方案.在这个简单的例子中,我们不需要 roll=TRUE 因为日期没有间隔.然而,在一个更现实的例子中(如上所述,例如当我们有周末时),可能会有差距.因此,让我们通过在第一只股票的 DT 中删除两天来制作这样一个现实的例子:

This is already the correct solution. In this simple case, we do not need roll=TRUE because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT for the first stock:

DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011     NA
 5: 2011-01-01        2  0.013     NA
...

如您所见,问题在于我们没有 1 月 6 日的值.这就是我们使用 roll=TRUE 的原因:

As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE:

DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011  0.005
 5: 2011-01-01        2  0.013     NA
...

只需查看有关 roll=TRUE 工作原理的文档即可.简而言之:如果找不到之前的值(此处为 1 月 5 日的 logret),它只需要最后一个可用的值(此处为 1 月 3 日).

Just have a look on the documentation on how roll=TRUE works exactly. In a nutshell: If it can't find the previous value (here logret for the 5th of January), it just takes the last available one (here from the 3rd of January).

这篇关于用于滞后回归的 R data.table 分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆