用于滞后回归的R数据表分组 [英] R data.table grouping for lagged regression
问题描述
包含以下数据的数据表(其data.table对象):
date stock_id logret
1:2011-01-01 1 0.001
2:2011-01-02 1 0.003
3:2011-01-03 1 0.005
4:2011-01-04 1 0.007
5:2011-01-05 1 0.009
6:2011-01-06 1 0.011
7:2011-01-01 2 0.013
8:2011-01-02 2 0.015
9:2011-01-03 2 0.017
10:2011-01-04 2 0.019
11:2011-01-05 2 0.021
12:2011-01- 06 2 0.023
13:2011-01-01 3 0.025
14:2011-01-02 3 0.027
15:2011-01-03 3 0.029
16:2011- 01-04 3 0.031
17:2011-01-05 3 0.033
18:2011-01-06 3 0.035
以上可以创建为:
DT = data.table b date = rep(as.Date('2011-01-01')+ 0:5,3),
stock_id = c(1,1,1,1,1,1,2,2,2 ,2,2 b,3 logret = seq(0.001,by = 0.002,len = 18));
setkeyv(DT,c('stock_id','date'))
当然,真正的表有更大的stock_ids和日期。目标是重塑此数据表,以便我可以运行所有stockid log_returns的回归及其相应的log_returns,滞后1天(或在周末的前一交易日)。
最终结果如下所示:
1:2011-01-01 1 0.001 NA
2:2011-01-02 1 0.003 0.001
3:2011-01-03 1 0.005 0.003
....
16:2011-01-04 3 0.031 0.029
17:2011-01-05 3 0.033 0.031
18:2011-01-06 3 0.035 0.033
我发现这个数据结构真的很难构建而不混淆我的stockid。
解决方案由于Alex的评论,只有一些额外的笔记。你很难理解这里发生的事情的原因是很多事情都在一行内完成。所以,总是一个好主意。
我们真正想要什么?我们需要一个新列
lagret
,并且在data.table中添加一个新列的语法如下:DT [,lagret:= xxx]
$ c> xxx 必须填入
lagret
列中您想要的任何内容。因此,如果我们只想要一个给出行的新列,我们可以调用DT [,lagret:= seq from = 1,to = nrow(DT))]
logret
,但我们必须考虑这里有很多股票。这就是为什么我们做一个自联接,即我们通过列stock_id
加入data.tableDT
date
,但由于我们想要每个股票的以前的值,我们使用date-1
。注意,我们必须首先设置键来执行这样的连接:setkeyv(DT,c('stock_id'日期'))
DT [list(stock_id,date-1)]
stock_id date logret
1:1 2010-12-31 NA
2:1 2011-01- 01 0.001
3:1 2011-01-02 0.003
4:1 2011-01-03 0.005
5:1 2011-01-04 0.007
6:1 2011- 01-05 0.009
...
正如你所看到的,想。
logret
现在滞后一个周期。但我们实际上想要在lagret
在DT
中的一个新列中,所以我们只需要调用[[3L ]](这意味着没有任何东西,然后得到我的第三列),并命名这个新列lagret
:DT [,lagret:= DT [list(stock_id,date-1),logret] [[3L]]]
date stock_id logret lagret
1:2011-01 -01 1 0.001 NA
2:2011-01-02 1 0.003 0.001
3:2011-01-03 1 0.005 0.003
4:2011-01-04 1 0.007 0.005
5:2011-01-05 1 0.009 0.007
...
正确的解决方案。在这个简单的例子中,我们不需要
roll = TRUE
,因为日期没有间隙。然而,在更现实的示例中(如上所述,例如当我们有周末时),可能存在间隙。所以让我们通过在DT
中删除第一个股票的两天来做出这样一个现实的例子:DT <-DT [-c(4,5)]
setkeyv(DT,c('stock_id','date'))
DT [,lagret: DT [list(stock_id,date-1),logret] [[3L]]]
date stock_id logret lagret
1:2011-01-01 1 0.001 NA
2:2011-01 -02 1 0.003 0.001
3:2011-01-03 1 0.005 0.003
4:2011-01-06 1 0.011 NA
5:2011-01-01 2 0.013 NA
...
如您所见,问题是我们没有值1月6日。这就是为什么我们使用
roll = TRUE
:= DT [list(stock_id,date-1),logret,roll = TRUE] [[3L]]]
date stock_id logret lagret
1:2011-01-01 1 0.001 NA
2:2011-01-02 1 0.003 0.001
3:2011-01-03 1 0.005 0.003
4:2011-01-06 1 0.011 0.005
5:2011-01-01 2 0.013 NA
...
只需查看有关
roll = TRUE
正常工作。简而言之:如果它找不到以前的值(这里logret
为1月5日),它只是最后一个可用的(这是从1月3日)。table with data (its a data.table object) that looks like the following :
date stock_id logret 1: 2011-01-01 1 0.001 2: 2011-01-02 1 0.003 3: 2011-01-03 1 0.005 4: 2011-01-04 1 0.007 5: 2011-01-05 1 0.009 6: 2011-01-06 1 0.011 7: 2011-01-01 2 0.013 8: 2011-01-02 2 0.015 9: 2011-01-03 2 0.017 10: 2011-01-04 2 0.019 11: 2011-01-05 2 0.021 12: 2011-01-06 2 0.023 13: 2011-01-01 3 0.025 14: 2011-01-02 3 0.027 15: 2011-01-03 3 0.029 16: 2011-01-04 3 0.031 17: 2011-01-05 3 0.033 18: 2011-01-06 3 0.035
The above can be created as :
DT = data.table( date=rep(as.Date('2011-01-01')+0:5,3) , stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3), logret=seq(0.001, by=0.002, len=18)); setkeyv(DT,c('stock_id','date'))
Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).
The final results would look like :
date stock_id logret lagret 1: 2011-01-01 1 0.001 NA 2: 2011-01-02 1 0.003 0.001 3: 2011-01-03 1 0.005 0.003 .... 16: 2011-01-04 3 0.031 0.029 17: 2011-01-05 3 0.033 0.031 18: 2011-01-06 3 0.035 0.033
I'm finding this data structure really tricky to build without mixing up my stockid.
解决方案Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.
What do we actually want? We want a new column
lagret
and the syntax to add a new column in data.table is the following:DT[, lagret := xxx]
where
xxx
has to be filled up with whatever you want to have in columnlagret
. So if we just want a new column that gives us the rows, we could just callDT[, lagret := seq(from=1, to=nrow(DT))]
Here, we actually want the lagged value of
logret
, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.tableDT
with itself by the columnsstock_id
anddate
, but since we want the previous value of each stock, we usedate-1
. Note that we have to set the keys first to do such a join:setkeyv(DT,c('stock_id','date')) DT[list(stock_id,date-1)] stock_id date logret 1: 1 2010-12-31 NA 2: 1 2011-01-01 0.001 3: 1 2011-01-02 0.003 4: 1 2011-01-03 0.005 5: 1 2011-01-04 0.007 6: 1 2011-01-05 0.009 ...
As you can see, we now have what we want.
logret
is now lagged by one period. But we actually want that in a new columnlagret
inDT
, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new columnlagret
:DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]] date stock_id logret lagret 1: 2011-01-01 1 0.001 NA 2: 2011-01-02 1 0.003 0.001 3: 2011-01-03 1 0.005 0.003 4: 2011-01-04 1 0.007 0.005 5: 2011-01-05 1 0.009 0.007 ...
This is already the correct solution. In this simple case, we do not need
roll=TRUE
because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in theDT
for the first stock:DT <- DT[-c(4, 5)] setkeyv(DT,c('stock_id','date')) DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]] date stock_id logret lagret 1: 2011-01-01 1 0.001 NA 2: 2011-01-02 1 0.003 0.001 3: 2011-01-03 1 0.005 0.003 4: 2011-01-06 1 0.011 NA 5: 2011-01-01 2 0.013 NA ...
As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use
roll=TRUE
:DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]] date stock_id logret lagret 1: 2011-01-01 1 0.001 NA 2: 2011-01-02 1 0.003 0.001 3: 2011-01-03 1 0.005 0.003 4: 2011-01-06 1 0.011 0.005 5: 2011-01-01 2 0.013 NA ...
Just have a look on the documentation on how
roll=TRUE
works exactly. In a nutshell: If it can't find the previous value (herelogret
for the 5th of January), it just takes the last available one (here from the 3rd of January).这篇关于用于滞后回归的R数据表分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!