一次拟合多个公式,比lapply更快的选择? [英] Fit many formulae at once, faster options than lapply?

查看:88
本文介绍了一次拟合多个公式,比lapply更快的选择?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要适合数据的公式的列表,而不是运行一个循环,出于性能考虑,我想立即执行此操作.估计应该仍然是分开的,我不是要估计SUR或其他任何值. 以下代码可以实现我想要的

I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything. The following code does what I want

x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
                y~x[,2],
                y~x[,1] + x[,2])
lapply(formulae,lm)

不幸的是,随着formulae长度的增加,它变得有些慢了,有没有办法真正将其向量化?

Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?

如果有帮助,我关心的lm的唯一结果就是系数和一些标准误差.

If it is any help, the only results of lm I care about are coefficients, and some standard errors.

推荐答案

正如我在评论中所说,您真正需要的是除lm()之外更有效,更稳定的拟合例程.在这里,我将为您提供一个经过充分测试的自己写的,称为lm.chol().它需要一个formuladata,并返回:

As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:

  • 系数汇总表,如您通常在summary(lm(...))$coef中看到的;
  • summary(lm(...))$sigma获得的残留标准误差的皮尔森估计;
  • summary(lm(...))$adj.r.squared获得的
  • 调整后的R.squared.
  • a coefficient summary table, as you normally see in summary(lm(...))$coef;
  • Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
  • adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
  ## stage0: get response vector and model matrix
  ## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
  y <- data[[as.character(formula[[2]])]]
  X <- model.matrix(formula, data)
  n <- nrow(X); p <- ncol(X)
  ## stage 1: XtX and Jacobi diagonal preconditioner
  XtX <- crossprod(X)
  D <- 1 / sqrt(diag(XtX))
  ## stage 2: pivoted Cholesky factorization
  R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
  piv <- attr(R, "pivot")
  r <- attr(R, "rank")
  if (r < p) {
    warning("Model is rank-deficient!")
    piv <- piv[1:r]
    R <- R[1:r, 1:r]
    }
  ## stage 3: solve linear system for coefficients
  D <- D[piv]
  b <- D * crossprod(X, y)[piv]
  z <- forwardsolve(t(R), b)
  RSS <- sum(y * y) - sum(z * z)
  sigma <- sqrt(RSS / (n - r))
  para <- D * backsolve(R, z)
  beta.hat <- rep(NA, p)
  beta.hat[piv] <- para
  ## stage 4: get standard error
  Rinv <- backsolve(R, diag(r))
  se <- rep(NA, p)
  se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
  ## stage 5: t-statistic and p-value
  t.statistic <- beta.hat / se
  p.value <- 2 * pt(-abs(t.statistic), df = n - r)
  ## stage 6: construct coefficient summary matrix
  coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
  colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
  rownames(coefficients) <- colnames(X)
  ## stage 7: compute adjusted R.squared
  adj.R2 <- 1 - sigma * sigma / var(y)
  ## return model fitting results
  attr(coefficients, "sigma") <- sigma
  attr(coefficients, "adj.R2") <- adj.R2
  coefficients
  }

在这里,我将提供三个示例.

Here I would offer three examples.

示例1:完整线性模型

我们以R的内置数据集trees为例.

We take R's built-in dataset trees as an example.

# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept)  83.2958     9.0866   9.167 6.33e-10 ***
#Girth        -1.8615     1.1567  -1.609   0.1188    
#Volume        0.5756     0.2208   2.607   0.0145 *  
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared:  0.4123,   Adjusted R-squared:  0.3703 
#F-statistic:  9.82 on 2 and 28 DF,  p-value: 0.0005868

## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
#              Estimate Std. Error   t value     Pr(>|t|)
#(Intercept) 83.2957705  9.0865753  9.166905 6.333488e-10
#Girth       -1.8615109  1.1566879 -1.609346 1.187591e-01
#Volume       0.5755946  0.2208225  2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869

结果完全一样!

示例2:秩不足的线性模型

## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)

## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept)   0.2164     0.2530   0.856    0.394
#x1           -0.1526     0.3252  -0.469    0.640
#x2           -0.3534     0.5707  -0.619    0.537
#x3                NA         NA      NA       NA

#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared:  0.0069,   Adjusted R-squared:  -0.01358 
#F-statistic: 0.337 on 2 and 97 DF,  p-value: 0.7147

## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
#              Estimate Std. Error    t value  Pr(>|t|)
#(Intercept)  0.2164455  0.2529576  0.8556595 0.3942949
#x1                  NA         NA         NA        NA
#x2          -0.2007894  0.6866871 -0.2924030 0.7706030
#x3          -0.3051760  0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!

在这里,基于具有完全旋转的Cholesky分解的lm.chol()和基于具有部分旋转的QR分解的lm()将不同的系数缩小为NA.但是两个估计是等效的,具有相同的拟合值和残差.

Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.

示例3:大型线性模型的性能

n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)

## using `lm()`
system.time(lm(y ~ ., dat))
#   user  system elapsed 
#  3.212   0.096   3.315

## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
#   user  system elapsed 
#  1.024   0.028   1.056

lm.chol()lm()快3到4倍.如果您想知道原因,请阅读我的答案.

lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.

备注

我专注于提高计算内核的性能.通过使用Ben Bolker的并行性建议,您可以更进一步.如果我的方法在4个内核上提供了3倍的提升,而并行计算在4个内核上提供了3倍的提升,那么您最终将获得9倍的提升!

I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!

这篇关于一次拟合多个公式,比lapply更快的选择?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆