多列滚动回归 [英] Rolling regression over multiple columns

查看:15
本文介绍了多列滚动回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在寻找最有效的方法来计算具有多列的 xts 对象的滚动线性回归时遇到了一个问题.我在 stackoverflow 上搜索并阅读了之前的几个问题.

I have an issue finding the most efficient way to calculate a rolling linear regression over a xts object with multiple columns. I have searched and read several previously questions here on stackoverflow.

这个问答 接近但在我看来还不够,因为我想在所有回归中计算因变量不变的多元回归.我试图用随机数据重现一个例子:

This question and answer comes close but not enough in my opinion as I want to calculate multiple regressions with the dependent variable unchanged in all the regressions. I have tried to reproduce an example with random data:

require(xts)
require(RcppArmadillo)  # Load libraries

data <- matrix(sample(1:10000, 1500), 1500, 5, byrow = TRUE)  # Random data
data[1000:1500, 2] <- NA  # insert NAs to make it more similar to true data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))

NR <- nrow(data)  # number of observations
NC <- ncol(data)  # number of factors
obs <- 30  # required number of observations for rolling regression analysis
info.names <- c("res", "coef")

info <- array(NA, dim = c(NR, length(info.names), NC))
colnames(info) <- info.names

创建该数组是为了随时间和每个因子存储多个变量(残差、系数等).

The array is created in order to store multiple variables (residuals, coefficients etc.) over time and per factor.

loop.begin.time <- Sys.time()

for (j in 2:NC) {
  cat(paste("Processing residuals for factor:", j), "\n")
  for (i in obs:NR) {
    regression.temp <- fastLm(data[i:(i-(obs-1)), j] ~ data[i:(i-(obs-1)), 1])
    residuals.temp <- regression.temp$residuals
    info[i, "res", j] <- round(residuals.temp[1] / sd(residuals.temp), 4)
    info[i, "coef", j] <- regression.temp$coefficients[2]
  } 
}

loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time)  # prints the loop runtime

正如循环所示,这个想法是运行 30 个观察值的滚动回归,每次将 data[, 1] 作为因变量(因子)与其他因子之一进行对比.我必须将 30 个残差存储在一个临时对象中,以便对它们进行标准化,因为 fastLm 不计算标准化残差.

As the loop shows the idea is to run a 30 observations rolling regression with data[, 1] as the dependent variable (factor) every time against one of the other factors. I have to store the 30 residuals in a temporary object in order to standardize them as fastLm does not calculate standardized residuals.

如果 xts 对象中的列数(因子)增加到大约 100 - 1,000 列将需要一个永恒的时间,则循环非常缓慢并且变得很麻烦.我希望有一个更有效的代码来在大型数据集上创建滚动回归.

The loop is extremely slow and becomes a cumbersome if the numbers of columns (factors) in the xts object increases to around 100 - 1,000 columns would take an eternity. I hope one has a more efficient code to create rolling regressions over a large data set.

推荐答案

如果你深入到线性回归的数学水平,应该很快.如果 X 是自变量,Y 是因变量.系数由

It should be pretty quick if you go down to level of the math of the linear regression. If X is the independent variable and Y is the dependent variable. The coefficients are given by

Beta = inv(t(X) %*% X) %*% (t(X) %*% Y)

对于您希望哪个变量成为依赖变量以及哪个变量是独立变量,我有点困惑,但希望解决下面的类似问题也会对您有所帮助.

I'm a little confused about which variable you want to be the dependent and which one is the independent but hopefully solving a similar problem below will help you as well.

在下面的示例中,我采用了 1000 个变量而不是原来的 5 个变量,并且没有引入任何 NA.

In the example below I take 1000 variables instead of the original 5 and do not introduce any NA's.

require(xts)

data <- matrix(sample(1:10000, 1500000, replace=T), 1500, 1000, byrow = TRUE)  # Random data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))

NR <- nrow(data)  # number of observations
NC <- ncol(data)  # number of factors
obs <- 30  # required number of observations for rolling regression analysis

现在我们可以使用 Joshua 的 TTR 包计算系数.

Now we can calculate the coefficients using Joshua's TTR package.

library(TTR)

loop.begin.time <- Sys.time()

in.dep.var <- data[,1]
xx <- TTR::runSum(in.dep.var*in.dep.var, obs)
coeffs <- do.call(cbind, lapply(data, function(z) {
    xy <- TTR::runSum(z * in.dep.var, obs)
    xy/xx
}))

loop.end.time <- Sys.time()

print(loop.end.time - loop.begin.time)  # prints the loop runtime

时差 3.934461 秒

Time difference of 3.934461 secs

res.array = array(NA, dim=c(NC, NR, obs))
for(z in seq(obs)) {
  res.array[,,z] = coredata(data - lag.xts(coeffs, z-1) * as.numeric(in.dep.var))
}
res.sd <- apply(res.array, c(1,2), function(z) z / sd(z))

如果我在索引中没有犯任何错误,res.sd 应该会给你标准化的残差.请随时修复此解决方案以纠正任何错误.

If I haven't made any errors in the indexing res.sd should give you the standardized residuals. Please feel free to fix this solution to correct any bugs.

这篇关于多列滚动回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆