cv.glmnet是否使用完整的lambda序列来过度拟合数据? [英] Is cv.glmnet overfitting the the data by using the full lambda sequence?
问题描述
cv.glmnet
已被大多数研究论文和公司使用。为 glmnet.cr
构建类似的功能,例如 cv.glmnet
(类似的程序包,实现套索的连续比率序数回归)我在 cv.glmnet
中遇到了这个问题。
cv.glmnet
has been used by most research papers and companies. While building a similar function like cv.glmnet
for glmnet.cr
(a similar package that implements the lasso for continuation ratio ordinal regression) I came across this problem in cv.glmnet
.
`cv.glmnet` first fits the model:
glmnet.object = glmnet(x, y, weights = weights, offset = offset,
lambda = lambda, ...)
在 glmnet
使用完整的数据创建对象,下一步如下:
从拟合的完整模型中提取 lambda
After the glmnet
object is created with the complete data, the next step goes as follows:
The lambda
from the complete model fitted is extracted
lambda = glmnet.object$lambda
现在,他们可以确保折数大于3
Now they make sure number of folds is more than 3
if (nfolds < 3)
stop("nfolds must be bigger than 3; nfolds=10 recommended")
创建了一个列表来存储交叉验证结果
A list is created to store cross validated results
outlist = as.list(seq(nfolds))
A for循环
根据交叉验证理论运行以适合不同的数据部分
A for loop
runs to fit different data parts per the theory of cross-validation
for (i in seq(nfolds)) {
which = foldid == i
if (is.matrix(y))
y_sub = y[!which, ]
else y_sub = y[!which]
if (is.offset)
offset_sub = as.matrix(offset)[!which, ]
else offset_sub = NULL
#using the lambdas for the complete data
outlist[[i]] = glmnet(x[!which, , drop = FALSE],
y_sub, lambda = lambda, offset = offset_sub,
weights = weights[!which], ...)
}
}
那么会发生什么。将数据拟合为完整数据后,将使用完整数据中的lambda进行交叉验证。有人可以告诉我这怎么可能不是数据过拟合?我们进行交叉验证时,希望模型没有有关数据遗漏部分的信息。但是 cv.glmnet
对此作弊!
So what happens. After fitting the data to the complete data, cross-validation is done, with lambdas from the complete data. Can someone tell me how this can possibly not be data over-fitting?. We in cross-validation want the model to have no information about the left out part of the data. But cv.glmnet
cheats on this!
推荐答案
不,这是
cv.glmnet()
确实为lambda序列构建了完整的求解路径。但是您永远不会选择该路径中的最后一个条目。您通常选择 lambda == lambda.1se
(或 lambda.min
),如@Fabians所说:
cv.glmnet()
does build the entire solution path for the lambda sequence. But you never pick the last entry in that path. You typically pick lambda==lambda.1se
(or lambda.min
) , as @Fabians said:
lambda==lambda.min : is the lambda-value where cvm is minimized
lambda==lambda.1se : is the lambda-value where (cvm-cvsd)=cvlow is minimized. This is your optimal lambda
请参见 cv.glmnet()
和 coef(...,s ='lambda.1se')
这篇关于cv.glmnet是否使用完整的lambda序列来过度拟合数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!