用于滞后回归的R数据表分组 [英] R data.table grouping for lagged regression

查看:167
本文介绍了用于滞后回归的R数据表分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

包含以下数据的数据表(其data.table对象):

  date stock_id logret 
1:2011-01-01 1 0.001
2:2011-01-02 1 0.003
3:2011-01-03 1 0.005
4:2011-01-04 1 0.007
5:2011-01-05 1 0.009
6:2011-01-06 1 0.011
7:2011-01-01 2 0.013
8:2011-01-02 2 0.015
9:2011-01-03 2 0.017
10:2011-01-04 2 0.019
11:2011-01-05 2 0.021
12:2011-01- 06 2 0.023
13:2011-01-01 3 0.025
14:2011-01-02 3 0.027
15:2011-01-03 3 0.029
16:2011- 01-04 3 0.031
17:2011-01-05 3 0.033
18:2011-01-06 3 0.035

以上可以创建为:

  DT = data.table b date = rep(as.Date('2011-01-01')+ 0:5,3),
stock_id = c(1,1,1,1,1,1,2,2,2 ,2,2 b,3 logret = seq(0.001,by = 0.002,len = 18));

setkeyv(DT,c('stock_id','date'))

当然,真正的表有更大的stock_ids和日期。目标是重塑此数据表,以便我可以运行所有stockid log_returns的回归及其相应的log_returns,滞后1天(或在周末的前一交易日)。



最终结果如下所示:

  
1:2011-01-01 1 0.001 NA
2:2011-01-02 1 0.003 0.001
3:2011-01-03 1 0.005 0.003
....
16:2011-01-04 3 0.031 0.029
17:2011-01-05 3 0.033 0.031
18:2011-01-06 3 0.035 0.033



我发现这个数据结构真的很难构建而不混淆我的stockid。

解决方案

由于Alex的评论,只有一些额外的笔记。你很难理解这里发生的事情的原因是很多事情都在一行内完成。所以,总是一个好主意。



我们真正想要什么?我们需要一个新列 lagret ,并且在data.table中添加一个新列的语法如下:

  DT [,lagret:= xxx] 

$ c> xxx 必须填入 lagret 列中您想要的任何内容。因此,如果我们只想要一个给出行的新列,我们可以调用

  DT [,lagret:= seq from = 1,to = nrow(DT))] 

logret ,但我们必须考虑这里有很多股票。这就是为什么我们做一个自联接,即我们通过列 stock_id 加入data.table DT date ,但由于我们想要每个股票的以前的值,我们使用 date-1 。注意,我们必须首先设置键来执行这样的连接:

  setkeyv(DT,c('stock_id'日期'))
DT [list(stock_id,date-1)]
stock_id date logret
1:1 2010-12-31 NA
2:1 2011-01- 01 0.001
3:1 2011-01-02 0.003
4:1 2011-01-03 0.005
5:1 2011-01-04 0.007
6:1 2011- 01-05 0.009
...

正如你所看到的,想。 logret 现在滞后一个周期。但我们实际上想要在 lagret DT 中的一个新列中,所以我们只需要调用[[3L ]](这意味着没有任何东西,然后得到我的第三列),并命名这个新列 lagret

  DT [,lagret:= DT [list(stock_id,date-1),logret] [[3L]]] 
date stock_id logret lagret
1:2011-01 -01 1 0.001 NA
2:2011-01-02 1 0.003 0.001
3:2011-01-03 1 0.005 0.003
4:2011-01-04 1 0.007 0.005
5:2011-01-05 1 0.009 0.007
...

正确的解决方案。在这个简单的例子中,我们不需要 roll = TRUE ,因为日期没有间隙。然而,在更现实的示例中(如上所述,例如当我们有周末时),可能存在间隙。所以让我们通过在 DT 中删除​​第一个股票的两天来做出这样一个现实的例子:

  DT <-DT [-c(4,5)] 
setkeyv(DT,c('stock_id','date'))
DT [,lagret: DT [list(stock_id,date-1),logret] [[3L]]]
date stock_id logret lagret
1:2011-01-01 1 0.001 NA
2:2011-01 -02 1 0.003 0.001
3:2011-01-03 1 0.005 0.003
4:2011-01-06 1 0.011 NA
5:2011-01-01 2 0.013 NA
...

如您所见,问题是我们没有值1月6日。这就是为什么我们使用 roll = TRUE

  = DT [list(stock_id,date-1),logret,roll = TRUE] [[3L]]] 
date stock_id logret lagret
1:2011-01-01 1 0.001 NA
2:2011-01-02 1 0.003 0.001
3:2011-01-03 1 0.005 0.003
4:2011-01-06 1 0.011 0.005
5:2011-01-01 2 0.013 NA
...

只需查看有关 roll = TRUE 正常工作。简而言之:如果它找不到以前的值(这里 logret 为1月5日),它只是最后一个可用的(这是从1月3日)。


table with data (its a data.table object) that looks like the following :

      date         stock_id logret
   1: 2011-01-01        1  0.001
   2: 2011-01-02        1  0.003
   3: 2011-01-03        1  0.005
   4: 2011-01-04        1  0.007
   5: 2011-01-05        1  0.009
   6: 2011-01-06        1  0.011
   7: 2011-01-01        2  0.013
   8: 2011-01-02        2  0.015
   9: 2011-01-03        2  0.017
  10: 2011-01-04        2  0.019
  11: 2011-01-05        2  0.021
  12: 2011-01-06        2  0.023
  13: 2011-01-01        3  0.025
  14: 2011-01-02        3  0.027
  15: 2011-01-03        3  0.029
  16: 2011-01-04        3  0.031
  17: 2011-01-05        3  0.033
  18: 2011-01-06        3  0.035

The above can be created as :

DT = data.table(
   date=rep(as.Date('2011-01-01')+0:5,3) , 
   stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
  logret=seq(0.001, by=0.002, len=18));

setkeyv(DT,c('stock_id','date'))

Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).

The final results would look like :

      date         stock_id logret lagret
   1: 2011-01-01        1  0.001    NA
   2: 2011-01-02        1  0.003    0.001
   3: 2011-01-03        1  0.005    0.003
   ....
  16: 2011-01-04        3  0.031  0.029
  17: 2011-01-05        3  0.033  0.031
  18: 2011-01-06        3  0.035  0.033

I'm finding this data structure really tricky to build without mixing up my stockid.

解决方案

Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.

What do we actually want? We want a new column lagret and the syntax to add a new column in data.table is the following:

DT[, lagret := xxx]

where xxx has to be filled up with whatever you want to have in column lagret. So if we just want a new column that gives us the rows, we could just call

DT[, lagret := seq(from=1, to=nrow(DT))]

Here, we actually want the lagged value of logret, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT with itself by the columns stock_id and date, but since we want the previous value of each stock, we use date-1. Note that we have to set the keys first to do such a join:

setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
    stock_id       date logret
 1:        1 2010-12-31     NA
 2:        1 2011-01-01  0.001
 3:        1 2011-01-02  0.003
 4:        1 2011-01-03  0.005
 5:        1 2011-01-04  0.007
 6:        1 2011-01-05  0.009
...

As you can see, we now have what we want. logret is now lagged by one period. But we actually want that in a new column lagret in DT, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret:

DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-04        1  0.007  0.005
 5: 2011-01-05        1  0.009  0.007
...

This is already the correct solution. In this simple case, we do not need roll=TRUE because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT for the first stock:

DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011     NA
 5: 2011-01-01        2  0.013     NA
...

As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE:

DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011  0.005
 5: 2011-01-01        2  0.013     NA
...

Just have a look on the documentation on how roll=TRUE works exactly. In a nutshell: If it can't find the previous value (here logret for the 5th of January), it just takes the last available one (here from the 3rd of January).

这篇关于用于滞后回归的R数据表分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆