如何计算最小但快速线性回归上的响应矩阵的每列? [英] How to compute minimal but fast linear regressions on each column of a response matrix?

查看:173
本文介绍了如何计算最小但快速线性回归上的响应矩阵的每列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要计算普通最小二乘法( OLS )估计,在研发的不使用流明,这有几个原因。首先,LM也计算很多东西我不需要(如拟合值),考虑到数据的大小是在我的案件的问题。第二,我希望能够(与GSL如:在C)做它用另一种语言之前实现OLS自己在R上。

正如你可能知道,在模式为:Y = Xb的+ E;与E〜N(0,西格玛^ 2)。正如下面详细描述的,b为具有2个参数,平均值(B0)和另一系数(b1)中的载体。最后,对于每一个线性回归,我会做什么,我想估计为B1(影响大小),其标准错误,估计西格玛^ 2(残余方差),和R ^ 2(判定系数)。

下面是我的数据。我有N个样本(例如,个人,N〜= 100)。对于每个样品,我有ÿ输出(响应变量,y〜= 10 ^ 3)和X点(解释变量,X〜= 10 ^ 6)。我要分开处理在Y输出,即。我想推出Y线性回归:一个是输出1,一个用于输出2等。此外,我想用解释变量一个Y之一:输出1,我要回归它1点,然后在第2点,然后......最后在X点(我希望这是明确的......!)

下面是我的研究code 以检查LM的速度与计算OLS通过矩阵代数估计自己。

首先,我模拟虚拟数据:

  nb.samples<  -  10#N
nb.points<  -  1000#x的
X  -  LT;  - 矩阵(数据=复制(nb.samples,样品(X = 0:2,大小= nb.points,替换= T)),
            nrow = nb.points,NCOL = nb.samples,byrow = F,
            dimnames =列表(分=糊(P,以次(1,nb.points),九月=),
              样本=糊(S,以次(1,nb.samples),九月=)))
nb.outputs<  -  10#y的
Y'LT;  - 矩阵(数据=复制(nb.outputs,RNORM(nb.samples)),
            nrow = nb.samples,NCOL = nb.outputs,byrow = T,
            dimnames =列表(样品=糊(S,以次(1,nb.samples),九月=),
              输出=膏(走出去,以次(1,nb.outputs),九月=)))
 

下面是使用仅低于我自己的功能:

  GetResFromCustomLinReg<  - 功能(Y,十一){#Y和夕是N-昏暗的载体
  N'LT;  - 长度(Y)
  X  -  LT;  -  cbind(REP(1,N),十一)#
  P&其中;  -  1#nb的解释变量,除了平均
  为r  -  P + 1#排名的X:对indepdt解释变量NB
  inv.XtX<  - 解决(T(X)%*%X)
  beta.hat<  -  inv.XtX%*%T(X)%*%Y
  Y.hat<  -  X%*%beta.hat
  E.hat<  -  Y  -  Y.hat
  E2.hat<  - (T(E.hat)%*%E.hat)
  sigma2.hat<  - (E2.hat /(N  -  R))[1,1]
  var.covar.beta.hat<  -  sigma2.hat * inv.XtX
  se.beta.hat<  -  T(T(开方(诊断(var.covar.beta.hat))))
  Y.bar<  - 平均(Y)
  R2<  -  1  - (E2.hat)/(T(YY.bar)%*%(YY.bar))
  返回(三(beta.hat [2],se.beta.hat [2],sigma2.hat,R2))
}
 

下面是我使用内置的LMcode:

  res.bi.all<  - 申请(X,1,函数(十一){流明(Y〜十一)})
 

下面是我的自定义OLS code:

  res.cm.all<  - 申请(X,1,函数(十一){适用(Y,2,GetResFromCustomLinReg,喜)})
 

当我运行这个例子与上面给出的值,我得到:

 > system.time(res.bi.all<  - 申请(X,1,函数(十一){流明(Y〜十一)}))
   用户系统经过
  2.526 0.000 2.528
> system.time(res.cm.all<  - 申请(X,1,函数(十一){适用(Y,2,GetResFromCustomLinReg,十一)}))
   用户系统经过
  4.561 0.000 4.561
 

(而且,当然,它增加N,X和Y的时候会变得更糟。)

当然,LM有漂亮的财产自动分别拟合响应矩阵(Y〜XI)的每一列,而我必须使用应用Y次(每次义〜十一)。但是,这就是为什么我的code是慢的唯一原因?难道你们中的一个知道如何改进呢?

(对不起,这么长的问题,但我真的试图提供一个最小的,但COM prehensive例子。)

 > sessionInfo()
 -  [R 2.12.2版本(2011-02-25)
平台:x86_64的,红帽Linux的GNU的(64位)
 

解决方案

有中的 fastLm()函数:// dirk.eddelbuettel.com/$c$c/rcpp.armadillo.html">RcppArmadillo~~V 包在CRAN。还有一个类似的 fastLm() RcppGSL 这preceded这一点 - 但你可能要在犰狳为基础的解决方案。我在旧的presentations一些幻灯片(HPC上带有R),显示的速度增长。

另外要注意的提示,在帮助页面的莫过于X'X的直逆可与退化模型矩阵关系较好的摆动的做法。

I want to compute ordinary least square (OLS) estimates in R without using "lm", and this for several reasons. First, "lm" also computes lots of stuff I don't need (such as the fitted values) considering that data size is an issue in my case. Second, I want to be able to implement OLS myself in R before doing it in another language (eg. in C with the GSL).

As you may know, the model is: Y=Xb+E; with E ~ N(0, sigma^2). As detailed below, b is a vector with 2 parameters, the mean (b0) and another coefficients (b1). At the end, for each linear regression I will do, I want the estimate for b1 (effect size), its standard error, the estimate for sigma^2 (residual variance), and R^2 (determination coef).

Here are my data. I have N samples (eg. individuals, N~=100). For each sample, I have Y outputs (response variables, Y~=10^3) and X points (explanatory variables, X~=10^6). I want to treat the Y outputs separately, ie. I want to launch Y linear regressions: one for output 1, one for output 2, etc. Moreover, I want to use explanatory variables one y one: for output 1, I want to regress it on point 1, then on point 2, then ... finally on point X. (I hope it's clear...!)

Here is my R code to check the speed of "lm" versus computing OLS estimates myself via matrix algebra.

First, I simulate dummy data:

nb.samples <-  10  # N
nb.points <- 1000  # X
x <- matrix(data=replicate(nb.samples,sample(x=0:2,size=nb.points, replace=T)),
            nrow=nb.points, ncol=nb.samples, byrow=F,
            dimnames=list(points=paste("p",seq(1,nb.points),sep=""),
              samples=paste("s",seq(1,nb.samples),sep="")))
nb.outputs <- 10  # Y
y <- matrix(data=replicate(nb.outputs,rnorm(nb.samples)),
            nrow=nb.samples, ncol=nb.outputs, byrow=T,
            dimnames=list(samples=paste("s",seq(1,nb.samples),sep=""),
              outputs=paste("out",seq(1,nb.outputs),sep="")))

Here is my own function used just below:

GetResFromCustomLinReg <- function(Y, xi){ # both Y and xi are N-dim vectors
  n <- length(Y)
  X <- cbind(rep(1,n), xi)  #
  p <- 1      # nb of explanatory variables, besides the mean
  r <- p + 1  # rank of X: nb of indepdt explanatory variables
  inv.XtX <- solve(t(X) %*% X)
  beta.hat <- inv.XtX %*% t(X) %*% Y
  Y.hat <- X %*% beta.hat
  E.hat <- Y - Y.hat
  E2.hat <- (t(E.hat) %*% E.hat)
  sigma2.hat <- (E2.hat / (n - r))[1,1]
  var.covar.beta.hat <- sigma2.hat * inv.XtX
  se.beta.hat <- t(t(sqrt(diag(var.covar.beta.hat))))
  Y.bar <- mean(Y)
  R2 <- 1 - (E2.hat) / (t(Y-Y.bar) %*% (Y-Y.bar))
  return(c(beta.hat[2], se.beta.hat[2], sigma2.hat, R2))
}

Here is my code using the built-in "lm":

res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)})

Here is my custom OLS code:

res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)})

When I run this example with the values given above, I get:

> system.time( res.bi.all <- apply(x, 1, function(xi){lm(y ~ xi)}) )
   user  system elapsed
  2.526   0.000   2.528
> system.time( res.cm.all <- apply(x, 1, function(xi){apply(y, 2, GetResFromCustomLinReg, xi)}) )
   user  system elapsed
  4.561   0.000   4.561

(And, naturally, it gets worse when increasing N, X and Y.)

Of course, "lm" has the nice property of "automatically" fitting separately each column of the response matrix (y~xi), while I have to use "apply" Y times (for each yi~xi). But is this the only reason why my code is slower? Does one of you know how to improve this?

(Sorry for such a long question, but I really tried to provide a minimal, yet comprehensive example.)

> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-redhat-linux-gnu (64-bit)

解决方案

Have a look at the fastLm() function in the RcppArmadillo package on CRAN. There is also a similar fastLm() in RcppGSL which preceded this -- but you probably want the Armadillo-based solution. I have some slides in older presentations (on HPC with R) that show the speed gains.

Also note the hint in the help page about better 'pivoted' approaches than the straight inverse of X'X which can matter with degenerate model matrices.

这篇关于如何计算最小但快速线性回归上的响应矩阵的每列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆