用于滞后回归的 R data.table 分组 [英] R data.table grouping for lagged regression
问题描述
带有数据的表(它是一个 data.table 对象),如下所示:
table with data (its a data.table object) that looks like the following :
date stock_id logret
1: 2011-01-01 1 0.001
2: 2011-01-02 1 0.003
3: 2011-01-03 1 0.005
4: 2011-01-04 1 0.007
5: 2011-01-05 1 0.009
6: 2011-01-06 1 0.011
7: 2011-01-01 2 0.013
8: 2011-01-02 2 0.015
9: 2011-01-03 2 0.017
10: 2011-01-04 2 0.019
11: 2011-01-05 2 0.021
12: 2011-01-06 2 0.023
13: 2011-01-01 3 0.025
14: 2011-01-02 3 0.027
15: 2011-01-03 3 0.029
16: 2011-01-04 3 0.031
17: 2011-01-05 3 0.033
18: 2011-01-06 3 0.035
上面可以创建为:
DT = data.table(
date=rep(as.Date('2011-01-01')+0:5,3) ,
stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
logret=seq(0.001, by=0.002, len=18));
setkeyv(DT,c('stock_id','date'))
当然,实际表更大,包含更多的 stock_id 和日期.旨在重塑此数据表,以便我可以对所有 stockid log_returns 及其相应的 log_returns 进行回归,滞后 1 天(或周末的前交易日).
Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).
最终结果如下所示:
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
....
16: 2011-01-04 3 0.031 0.029
17: 2011-01-05 3 0.033 0.031
18: 2011-01-06 3 0.035 0.033
我发现在不混淆我的 stockid 的情况下构建这个数据结构真的很棘手.
I'm finding this data structure really tricky to build without mixing up my stockid.
推荐答案
由于 Alex 的评论,只是一些额外的注释.你很难理解这里发生了什么的原因是很多事情都是在一行中完成的.因此,分解事物总是一个好主意.
Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.
我们真正想要的是什么?我们想要一个新列 lagret
并且在 data.table 中添加一个新列的语法如下:
What do we actually want? We want a new column lagret
and the syntax to add a new column in data.table is the following:
DT[, lagret := xxx]
其中 xxx
必须在 lagret
列中填写您想要的任何内容.因此,如果我们只想要一个新列来为我们提供行,我们可以调用
where xxx
has to be filled up with whatever you want to have in column lagret
. So if we just want a new column that gives us the rows, we could just call
DT[, lagret := seq(from=1, to=nrow(DT))]
这里,我们实际上想要logret
的滞后值,但是我们必须考虑到这里有很多股票.这就是我们进行自连接的原因,即我们通过列 stock_id
和 date
将 data.table DT
与自身连接起来,但是因为我们想要每只股票的前值,我们使用 date-1
.请注意,我们必须先设置键才能进行这样的连接:
Here, we actually want the lagged value of logret
, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT
with itself by the columns stock_id
and date
, but since we want the previous value of each stock, we use date-1
. Note that we have to set the keys first to do such a join:
setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
stock_id date logret
1: 1 2010-12-31 NA
2: 1 2011-01-01 0.001
3: 1 2011-01-02 0.003
4: 1 2011-01-03 0.005
5: 1 2011-01-04 0.007
6: 1 2011-01-05 0.009
...
如您所见,我们现在拥有了我们想要的东西.logret
现在滞后一个周期.但我们实际上希望在 DT
中的新列 lagret
中获得该列,所以我们只需通过调用 [[3L]] 来获取该列(这意味着没有别的意思,然后给我第三个列)并将这个新列命名为 lagret
:
As you can see, we now have what we want. logret
is now lagged by one period. But we actually want that in a new column lagret
in DT
, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret
:
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-04 1 0.007 0.005
5: 2011-01-05 1 0.009 0.007
...
这已经是正确的解决方案.在这个简单的例子中,我们不需要 roll=TRUE
因为日期没有间隔.然而,在一个更现实的例子中(如上所述,例如当我们有周末时),可能会有差距.因此,让我们通过在第一只股票的 DT
中删除两天来制作这样一个现实的例子:
This is already the correct solution. In this simple case, we do not need roll=TRUE
because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT
for the first stock:
DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 NA
5: 2011-01-01 2 0.013 NA
...
如您所见,问题在于我们没有 1 月 6 日的值.这就是我们使用 roll=TRUE
的原因:
As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE
:
DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 0.005
5: 2011-01-01 2 0.013 NA
...
只需查看有关 roll=TRUE
工作原理的文档即可.简而言之:如果找不到之前的值(此处为 1 月 5 日的 logret
),它只需要最后一个可用的值(此处为 1 月 3 日).
Just have a look on the documentation on how roll=TRUE
works exactly. In a nutshell: If it can't find the previous value (here logret
for the 5th of January), it just takes the last available one (here from the 3rd of January).
这篇关于用于滞后回归的 R data.table 分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!