使用插入符号、glmnet 和(嵌套)交叉验证构建嵌套逻辑回归模型 [英] Building a nested logistic regression model using caret, glmnet and a (nested) cross-validation
问题描述
我的问题
我想建立一个具有高 AUC 的逻辑回归模型来预测二元变量.
我想使用以下方法(如果可行):
使用弹性网络模型 (
glmnet
) 减少预测变量并找到最佳超参数(alpha 和 lambda)在逻辑回归模型 (=
finalmodel<) 中将此模型的输出(一个简单的线性组合)与一个额外的预测变量(超级医生
superdoc
的意见)相结合/code>),类似于第 26 页中所述:
<块引用>
Afshar P、Mohammadi A、Plataniotis KN、Oikonomou A、Benali H.来自手工制作到基于深度学习的癌症放射组学:挑战和机会.IEEE 信号处理杂志 2019;36:132-60.可用此处
示例数据
作为示例数据,我有一个包含许多数字预测变量和二进制 (pos
/neg
) 结果 (diabetes
) 的数据集.>
# library图书馆(tidyverse)图书馆(插入符号)图书馆(glmnet)图书馆(mlbench)# 获取示例数据数据(PimaIndiansDiabetes,包=mlbench")数据 <- PimaIndiansDiabetes# 将超级医生的意见添加到数据中set.seed(2323)数据%>%rowwise() %>%mutate(superdoc=case_when(diabetes=="pos"~as.numeric(sample(0:2,1)),TRUE~0))->数据# 将训练集和测试集中的数据分开train.data <- 数据[1:550,]test.data <- 数据[551:768,]
由 reprex 包 (v1.0.0) 于 2021 年 3 月 14 日创建上>
我已经尝试过的
# 训练模型(没有superdoc的意见)set.seed(2323)模型 <- 火车(糖尿病 ~., 数据 = train.data %>% select(-superdoc), method = "glmnet",trControl = trainControl(cv",数字 = 10,classProbs = TRUE,savePredictions = TRUE,summaryFunction = twoClassSummary),tuneLength = 10,度量=ROC"#ROC 指标在 twoClassSummary 中)# 提取最佳 alpha 和 lambda 的系数coef(model$finalModel,model$finalModel$lambdaOpt) ->系数整洁(系数)%>% tibble() ->系数coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)coef.pregnant = coeffs %>% filter(row==pregnant") %>% pull(value)coef.glucose = coeffs %>% filter(row==glucose") %>% pull(value)coef.pressure = coeffs %>% filter(row ==pressure") %>% pull(value)coef.mass = coeffs %>% filter(row==mass") %>% pull(value)coef.pedigree = coeffs %>% filter(row==pedigree") %>% pull(value)coef.age = coeffs %>% filter(row==age") %>% pull(value)# 在逻辑回归模型中将模型与 superdoc 的意见结合起来finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregrant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial, data=train.data)# 对测试数据进行预测预测(最终模型,测试数据,类型=响应")->预测# 在测试数据中查看模型的AUCroc(test.data$diabetes,predictions,ci=TRUE)#>设置级别:control = neg,case = pos#>设置方向:控制<案例#>#>称呼:#>roc.default(response = test.data$diabetes,predictor = 预测,ci = TRUE)#>#>数据:145 个对照中的预测 (test.data$diabetes neg) <73 例(test.data$diabetes pos).#>曲线下面积:0.9345#>95% CI:0.8969-0.9721(德龙)
由 reprex 包 (v1.0.0) 于 2021 年 3 月 14 日创建
我不确定的地方...
我认为要找到最准确的模型并避免过度拟合,我必须使用嵌套交叉验证(正如我所了解的 此处 和此处).但是,我不知道该怎么做.目前,每次我使用另一个 set.seed
时,都会选择不同的预测变量,并得到不同的 AUC
.可以通过正确使用嵌套交叉验证来缓解这种情况吗?
更新 1
我刚刚了解到嵌套 CV 并不能帮助您获得最准确的模型.问题是,我在上面的第二个代码示例中得到了不同 set.seet
的变系数.我实际上有与此处描述的相同的问题:在插入符号中提取 glmnet 模型最佳调整参数的系数
一个已发布的解决方案是使用重复的 CV 来减少这种变化.不幸的是,我无法运行.
更新 2
使用 "repeatedcv"
解决了我的问题.使用重复的 cv not 嵌套 cv 成功了!
model <- train(糖尿病 ~., 数据 = train.data %>% select(-superdoc), method = "glmnet",trControl = trainControl(repeatedcv",数字 = 10,重复=10,classProbs = TRUE,savePredictions = TRUE,summaryFunction = twoClassSummary),tuneLength = 10,度量=ROC"#ROC 指标在 twoClassSummary 中)
感谢@missuse 我可以解决我的问题:
交叉验证无助于获得最准确的模型.这个(我的)误解在博客文章中得到了很好的讨论:The "Cross-验证 - 训练/预测"误会
小数据集中glmnet
的预测器系数的种子依赖变化的问题可以通过重复的交叉验证来缓解(即repeatedcv"
在caret::trainControl
如评论中所述 此处)
堆叠的学习器(在我的例子中是堆叠的 glmnet
和 glm
)通常是使用来自低级学习器的折叠预测构建的.这可以使用 mlr3
包完成,如本博文所述:调整堆叠学习器.由于这不是最初的问题,我打开了一个新问题 此处.
My problem
I would like to build a logistic regression model with a high AUC in predicting a binary variable.
I would like to use the following approach (if feasible):
Use an elastic net model (
glmnet
) to reduce the predictors and find the best hyperparameter (alpha and lambda)Combine the output of this model (a simple linear combination) with an additional predictor (the opinion of a super doctor
superdoc
) in a logistic regression model (=finalmodel
), similar as described on page 26 in:
Afshar P, Mohammadi A, Plataniotis KN, Oikonomou A, Benali H. From Handcrafted to Deep-Learning-Based Cancer Radiomics: Challenges and Opportunities. IEEE Signal Process Mag 2019; 36: 132–60. Available here
Example data
As example data I have a dataset with many numeric predictors and a binary (pos
/neg
) outcome (diabetes
).
# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)
# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes
# add the super doctors opinion to the data
set.seed(2323)
data %>%
rowwise() %>%
mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) -> data
# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]
Created on 2021-03-14 by the reprex package (v1.0.0)
What I already tried
# train the model (without the superdoc's opinion)
set.seed(2323)
model <- train(
diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
trControl = trainControl("cv",
number = 10,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary),
tuneLength = 10,
metric="ROC" #ROC metric is in twoClassSummary
)
# extract the coefficients for the best alpha and lambda
coef(model$finalModel, model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs
coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)
# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial, data=train.data)
# make predictions on the test data
predict(finalmodel,test.data, type="response") -> predictions
# check the AUC of the model in the test data
roc(test.data$diabetes,predictions, ci=TRUE)
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#>
#> Call:
#> roc.default(response = test.data$diabetes, predictor = predictions, ci = TRUE)
#>
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)
Created on 2021-03-14 by the reprex package (v1.0.0)
Where I am not really sure...
I think to find the most accurate model and to avoid overfitting, I have to use a nested cross-validation (as I learned here and here).
However, I am not sure how to do so.
At the moment everytime I use another set.seed
different predictors get selected and I get different AUCs
. Can this be migitated with proper use of nested cross validation?
Update 1
I just learned that nested CV does not help you get the most accurate model.
The problem is, that I get variating coefficients with different set.seet
in the second code sample above. I have actually the same problem as described here: Extract the coefficients for the best tuning parameters of a glmnet model in caret
One posted solution is to use repeated CV to migitate this variation. Unfortunately I could not get this run.
Update 2
Using "repeatedcv"
solved my problem. Using repeated cv not nested cv did the trick!
model <- train(
diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
trControl = trainControl("repeatedcv",
number = 10,
repeats=10,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary),
tuneLength = 10,
metric="ROC" #ROC metric is in twoClassSummary
)
Thanks to @missuse I could solve my questions:
Cross-validation does not help get the most accurate model. This (resp. my) misunderstanding is discussed beautifully in the blog post: The "Cross-Validation - Train/Predict" misunderstanding
The problem with seed depending variations of predictor's coefficients of glmnet
in small datasets can be migitated with repeated cross-validation (ie, "repeatedcv"
in caret::trainControl
as described in the comments here)
Stacked learners (in my case stacked glmnet
and glm
) are usually built using out of fold predictions from lower level lerners. This could be done using the mlr3
package as described in this blog post: Tuning a stacked learner. Since this was not an initial question, I opened a new question here.
这篇关于使用插入符号、glmnet 和(嵌套)交叉验证构建嵌套逻辑回归模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!