带有“glmnet"的岭回归给出的系数与我通过“教科书定义"计算的不同? [英] Ridge regression with `glmnet` gives different coefficients than what I compute by "textbook definition"?

查看:22
本文介绍了带有“glmnet"的岭回归给出的系数与我通过“教科书定义"计算的不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 glmnet R 包运行 Ridge 回归.我注意到我从 glmnet::glmnet 函数获得的系数与我通过定义计算系数获得的系数不同(使用相同的 lambda 值).有人能解释一下为什么吗?

I am running Ridge regression with the use of glmnet R package. I noticed that the coefficients I obtain from glmnet::glmnet function are different from those I get by computing coefficients by definition (with the use of the same lambda value). Could somebody explain me why?

数据(包括响应Y 和设计矩阵X)被缩放.

Data (both: response Y and design matrix X) are scaled.

library(MASS)
library(glmnet)

# Data dimensions
p.tmp <- 100
n.tmp <- 100

# Data objects
set.seed(1)
X <- scale(mvrnorm(n.tmp, mu = rep(0, p.tmp), Sigma = diag(p.tmp)))
beta <- rep(0, p.tmp)
beta[sample(1:p.tmp, 10, replace = FALSE)] <- 10
Y.true <- X %*% beta
Y <- scale(Y.true + matrix(rnorm(n.tmp))) # Y.true + Gaussian noise

# Run glmnet 
ridge.fit.cv <- cv.glmnet(X, Y, alpha = 0)
ridge.fit.lambda <- ridge.fit.cv$lambda.1se

# Extract coefficient values for lambda.1se (without intercept)
ridge.coef <- (coef(ridge.fit.cv, s = ridge.fit.lambda))[2:(p.tmp+1)]

# Get coefficients "by definition"
ridge.coef.DEF <- solve(t(X) %*% X + ridge.fit.lambda * diag(p.tmp)) %*% t(X) %*% Y

# Plot estimates
plot(ridge.coef, type = "l", ylim = range(c(ridge.coef, ridge.coef.DEF)),
     main = "black: Ridge `glmnet`
red: Ridge by definition")
lines(ridge.coef.DEF, col = "red")

推荐答案

如果你阅读?glmnet,你会看到高斯响应的惩罚目标函数是:

If you read ?glmnet, you will see that the penalized objective function of Gaussian response is:

1/2 * RSS / nobs + lambda * penalty

如果使用岭惩罚1/2 * ||beta_j||_2^2,我们有

In case the ridge penalty 1/2 * ||beta_j||_2^2 is used, we have

1/2 * RSS / nobs + 1/2 * lambda * ||beta_j||_2^2

RSS + lambda * nobs * ||beta_j||_2^2

这与我们通常在教科书中看到的关于岭回归的内容不同:

RSS + lambda * ||beta_j||_2^2

你写的公式:

##solve(t(X) %*% X + ridge.fit.lambda * diag(p.tmp)) %*% t(X) %*% Y
drop(solve(crossprod(X) + diag(ridge.fit.lambda, p.tmp), crossprod(X, Y)))

为教科书结果;对于 glmnet 我们应该期待:

is for the textbook result; for glmnet we should expect:

##solve(t(X) %*% X + n.tmp * ridge.fit.lambda * diag(p.tmp)) %*% t(X) %*% Y
drop(solve(crossprod(X) + diag(n.tmp * ridge.fit.lambda, p.tmp), crossprod(X, Y)))

所以,教科书使用惩罚最小二乘法,而glmnet使用惩罚均方误差.

So, the textbook uses penalized least squares, but glmnet uses penalized mean squared error.

注意我没有将你的原始代码与 t()"%*%"solve(A) %*% b 一起使用>;使用 crossprodsolve(A, b) 效率更高!最后请参阅跟进部分.

Note I did not use your original code with t(), "%*%" and solve(A) %*% b; using crossprod and solve(A, b) is more efficient! See Follow-up section in the end.

现在让我们做一个新的比较:

Now let's make a new comparison:

library(MASS)
library(glmnet)

# Data dimensions
p.tmp <- 100
n.tmp <- 100

# Data objects
set.seed(1)
X <- scale(mvrnorm(n.tmp, mu = rep(0, p.tmp), Sigma = diag(p.tmp)))
beta <- rep(0, p.tmp)
beta[sample(1:p.tmp, 10, replace = FALSE)] <- 10
Y.true <- X %*% beta
Y <- scale(Y.true + matrix(rnorm(n.tmp)))

# Run glmnet 
ridge.fit.cv <- cv.glmnet(X, Y, alpha = 0, intercept = FALSE)
ridge.fit.lambda <- ridge.fit.cv$lambda.1se

# Extract coefficient values for lambda.1se (without intercept)
ridge.coef <- (coef(ridge.fit.cv, s = ridge.fit.lambda))[-1]

# Get coefficients "by definition"
ridge.coef.DEF <- drop(solve(crossprod(X) + diag(n.tmp * ridge.fit.lambda, p.tmp), crossprod(X, Y)))

# Plot estimates
plot(ridge.coef, type = "l", ylim = range(c(ridge.coef, ridge.coef.DEF)),
     main = "black: Ridge `glmnet`
red: Ridge by definition")
lines(ridge.coef.DEF, col = "red")

请注意,我在调用 cv.glmnet(或 glmnet)时设置了 intercept = FALSE.这比它在实践中会产生的影响具有更多的概念意义.从概念上讲,我们教科书的计算没有拦截,所以我们想在使用 glmnet 时删除拦截.但实际上,由于您的 XY 是标准化的,截距的理论估计值为 0.即使使用 intercepte = TRUE (glment default),可以检查截距的估计值是~e-17(数值为0),因此对其他系数的估计没有明显影响.另一个答案只是展示了这一点.

Note that I have set intercept = FALSE when I call cv.glmnet (or glmnet). This has more conceptual meaning than what it will affect in practice. Conceptually, our textbook computation has no intercept, so we want to drop intercept when using glmnet. But practically, since your X and Y are standardized, the theoretical estimate of intercept is 0. Even with intercepte = TRUE (glment default), you can check that the estimate of intercept is ~e-17 (numerically 0), hence estimate of other coefficients is not notably affected. The other answer is just showing this.

跟进

至于使用 crossprodsolve(A, b) - 有趣!您是否偶然有任何参考模拟比较?

As for the using crossprod and solve(A, b) - interesting! Do you by chance have any reference to simulation comparison for that?

t(X) %*% Y 会先取转置 X1 <- t(X),然后再做 X1 %*% Ycode>,而 crossprod(X, Y) 不会进行转置."%*%"DGEMM 对于op(A) = A, op(B) = B,而 crossprod 是一个op(A) = A', op(B) = B 的包装器.同样tcrossprod for op(A) = A, op(B) = B'.

t(X) %*% Y will first take transpose X1 <- t(X), then do X1 %*% Y, while crossprod(X, Y) will not do the transpose. "%*%" is a wrapper for DGEMM for case op(A) = A, op(B) = B, while crossprod is a wrapper for op(A) = A', op(B) = B. Similarly tcrossprod for op(A) = A, op(B) = B'.

crossprod(X) 的一个主要用途是 t(X) %*% X;与 X %*% t(X)tcrossprod(X) 类似,在这种情况下 DSYRK 而不是 DGEMM 被调用.您可以阅读为什么内置 lm 函数在 R 中如此缓慢? 原因和基准.

A major use of crossprod(X) is for t(X) %*% X; similarly the tcrossprod(X) for X %*% t(X), in which case DSYRK instead of DGEMM is called. You can read the first section of Why the built-in lm function is so slow in R? for reason and a benchmark.

请注意,如果 X 不是方阵,crossprod(X)tcrossprod(X) 的速度不如它们快涉及不同数量的浮点运算,您可以阅读任何比更快的 R 函数"的附注tcrossprod"用于对称密集矩阵乘法?

Be aware that if X is not a square matrix, crossprod(X) and tcrossprod(X) are not equally fast as they involve different amount of floating point operations, for which you may read the side notice of Any faster R function than "tcrossprod" for symmetric dense matrix multiplication?

关于solvel(A, b)solve(A) %*% b,请阅读第一部分="https://stackoverflow.com/a/39534036/4891738">如何在不求逆矩阵的情况下有效地计算 diag(X %% solve(A) %% t(X))?

Regarding solvel(A, b) and solve(A) %*% b, please read the first section of How to compute diag(X %% solve(A) %% t(X)) efficiently without taking matrix inverse?

这篇关于带有“glmnet"的岭回归给出的系数与我通过“教科书定义"计算的不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆