R:gltnet的caret :: train函数可以在固定的alpha和lambda交叉验证AUC吗? [英] R: can caret::train function for glmnet cross-validate AUC at fixed alpha and lambda?

查看:85
本文介绍了R:gltnet的caret :: train函数可以在固定的alpha和lambda交叉验证AUC吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 caret::train

https://stats.stackexchange.com/questions/69638/does-caret-train-function-for-glmnet-cross-validate-for-both-alpha-and-lambda/69651 解释了如何交叉-使用 caret :: train

https://stats.stackexchange.com/questions/69638/does-caret-train-function-for-glmnet-cross-validate-for-both-alpha-and-lambda/69651 explains how to cross-validate alpha and lambda with caret::train

我关于交叉验证的问题已被关闭,因为它已被归类为编程问题:

My question on Cross Validated got closed, because it has been classified as a programming question: https://stats.stackexchange.com/questions/505865/r-calculate-the-10-fold-crossvalidated-auc-with-glmnet-and-given-alpha-and-lamb?noredirect=1#comment934491_505865

我有什么

数据集:

library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# example data
data(PimaIndiansDiabetes, package="mlbench")

# make a training set
set.seed(2323)
train.data <- PimaIndiansDiabetes

我的模特:

# build a model using the training set
set.seed(2323)
model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE),
  tuneLength = 10,
  metric="ROC"
)

我在这里得到错误:

Warning message:
In train.default(x, y, weights = w, ...) :
  The metric "ROC" was not in the result set. Accuracy will be used instead.

如果我忽略该错误,则最好的alpha和lambda将是:

If I ignore the error the best alpha and lambda would be:

model$bestTune
   alpha      lambda
11   0.2 0.002926378

现在,我想使用我的模型获得10倍交叉验证的AUC,该模型具有最佳的alpha和lambda以及火车数据.

Now I would like to get a 10-fold cross-validated AUC using my model with the best alpha and lambda and the train data.

我尝试过的事情

我的方法将是这样,但是,我得到了错误:出了点问题;所有精度指标值均缺失:

My approach would be something like this, however, I get the error: Something is wrong; all the Accuracy metric values are missing:

model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE),
  alpha=model$bestTune$alpha,
  lambda=model$bestTune$lambda,
  tuneLength = 10,
  metric="ROC"
)

如何使用最佳的alpha和lambda以及火车数据来计算交叉验证的AUC?

How could I calculate a cross-validated AUC using the optimal alpha and lambda and the train data?

我仍然不确定如何对AUC进行交叉验证,准确性.

I am still not sure how to cross-validate for AUC not, Accuracy.

谢谢您的帮助.

推荐答案

您打算使用"ROC"-ROC曲线下的区域以选择最佳调整参数,但您未指定

You intend to use "ROC" - area under the ROC curve to pick the best tuning parameters but you do not specify twoClassSummary() which holds this metric. This is what the warning is informing you

Warning message:
In train.default(x, y, weights = w, ...) :
  The metric "ROC" was not in the result set. Accuracy will be used instead.

进行车削:

library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

data(PimaIndiansDiabetes, package="mlbench")

set.seed(2323)
train.data <- PimaIndiansDiabetes

set.seed(2323)
model <- train(
  diabetes ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)

由于您指定了 classProbs = TRUE savePredictions = TRUE ,因此您可以基于预测来计算任何度量.计算精度:

Since you specified classProbs = TRUE and savePredictions = TRUE you can calculate any metric based on the predictions. The calculate accuracy:

model$pred %>%
  filter(alpha == model$bestTune$alpha,   #filter predictions for best tuning parameters
         lambda == model$bestTune$lambda) %>%
  group_by(Resample) %>% #group by fold
  summarise(acc = sum(pred == obs)/n()) #calculate metric
#output
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 10 x 2
   Resample   acc
   <chr>    <dbl>
 1 Fold01   0.740
 2 Fold02   0.753
 3 Fold03   0.818
 4 Fold04   0.776
 5 Fold05   0.779
 6 Fold06   0.753
 7 Fold07   0.766
 8 Fold08   0.792
 9 Fold09   0.727
10 Fold10   0.789

这将为您提供每折指标.获得平均效果

This gives you per fold metric. To get the average performance

model$pred %>%
  filter(alpha == model$bestTune$alpha,
         lambda == model$bestTune$lambda) %>%
  group_by(Resample) %>%
  summarise(acc = sum(pred == obs)/n()) %>%
  pull(acc) %>%
  mean
#output
0.769566

将ROC用作选择指标时,将在所有决策阈值上优化超级参数.在许多情况下,选择的模型将使用默认决策阈值0.5进行次优设计.

When ROC is used as a selection metric the hyper parameters are optimized over all decision thresholds. In many cases the chosen model would preform suboptimal using the default decision threshold of 0.5.

Caret具有功能 thresholder()

Caret has a function thresholder()

它将根据超过指定决策阈值的重新采样数据来计算许多指标.

it will calculate many metrics based on the resampled data over specified decision thresholds.

thresholder(model, seq(0, 1, length.out = 10)) #in reality I would use length.out = 100

#output

alpha     lambda prob_threshold Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall        F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy  Accuracy
1    0.1 0.03607775      0.0000000       1.000  0.00000000      0.6510595            NaN 0.6510595  1.000 0.7886514  0.6510595      0.6510595            1.0000000         0.5000000 0.6510595
2    0.1 0.03607775      0.1111111       0.994  0.02621083      0.6557464      0.7380952 0.6557464  0.994 0.7901580  0.6510595      0.6471463            0.9869617         0.5101054 0.6562714
3    0.1 0.03607775      0.2222222       0.986  0.15270655      0.6850874      0.8711111 0.6850874  0.986 0.8082906  0.6510595      0.6419344            0.9375256         0.5693533 0.6952837
4    0.1 0.03607775      0.3333333       0.964  0.32421652      0.7278778      0.8406807 0.7278778  0.964 0.8290127  0.6510595      0.6276316            0.8633459         0.6441083 0.7408578
5    0.1 0.03607775      0.4444444       0.928  0.47364672      0.7674158      0.7903159 0.7674158  0.928 0.8395895  0.6510595      0.6041866            0.7877990         0.7008234 0.7695147
6    0.1 0.03607775      0.5555556       0.862  0.59002849      0.7970454      0.7053968 0.7970454  0.862 0.8274687  0.6510595      0.5611928            0.7043575         0.7260142 0.7669686
7    0.1 0.03607775      0.6666667       0.742  0.75740741      0.8521972      0.6114289 0.8521972  0.742 0.7926993  0.6510595      0.4830827            0.5677204         0.7497037 0.7473855
8    0.1 0.03607775      0.7777778       0.536  0.90284900      0.9156149      0.5113452 0.9156149  0.536 0.6739140  0.6510595      0.3489918            0.3828606         0.7194245 0.6640636
9    0.1 0.03607775      0.8888889       0.198  0.98119658      0.9573810      0.3967404 0.9573810  0.198 0.3231917  0.6510595      0.1289474            0.1354751         0.5895983 0.4713602
10   0.1 0.03607775      1.0000000       0.000  1.00000000            NaN      0.3489405       NaN  0.000       NaN  0.6510595      0.0000000            0.0000000         0.5000000 0.3489405
       Kappa          J      Dist
1  0.0000000 0.00000000 1.0000000
2  0.0258717 0.02021083 0.9738516
3  0.1699809 0.13870655 0.8475624
4  0.3337322 0.28821652 0.6774055
5  0.4417759 0.40164672 0.5329805
6  0.4692998 0.45202849 0.4363768
7  0.4727251 0.49940741 0.3580090
8  0.3726156 0.43884900 0.4785352
9  0.1342372 0.17919658 0.8026597
10 0.0000000 0.00000000 1.0000000

现在根据所需指标选择一个阈值并使用该阈值.通常,与不平衡数据一起使用的指标科恩的Kappa 纸此事.

Now pick a threshold based on your desired metric and use that. Usually the metrics used with imbalanced data Cohen's Kappa, Youden's J or Matthews correlation coefficient (MCC). Here is a decent paper on the matter.

请注意,由于此数据用于找到最佳阈值,因此以这种方式获得的性能将存在乐观偏见.为了评估选择的决策阈值的性能,最好使用几个独立的测试集.换句话说,我建议使用嵌套重采样,您可以在其中使用内部折叠优化参数和阈值,并在外部折叠进行评估.

Please note that since this data was used to find the optimal threshold the performance obtained this way will be optimistically biased. To evaluate the performance of the picked decision threshold it would be best to use several independent test sets. In other words I recommend nested resampling where you would optimize the parameters and threshold using the inner folds and evaluate on the outer folds.

此处是有关如何使用嵌套重采样的说明插入符号与回归.需要进行一些修改才能使其与具有优化阈值的分类一起使用.

Here is an explanation on how to use nested resampling with caret with regression. Some modifications are needed to make it work with classification with optimized threshold.

请注意,这不是选择最佳决策阈值的唯一方法.另一种方法是先选择所需度量(例如,MCC),然后将决策阈值视为要与所有其他超参数一起进行调整的超参数.我相信创建自定义模型的插入符号不支持此功能.

Please note that this is not the only way to pick the best decision threshold. Another way is to pick the desired metric a priori (MCC for instance) and treat the decision threshold as a hyper parameter which is to be tuned jointly with all the other hyper parameters. I trust this is not supported with caret with creating custom models.

这篇关于R:gltnet的caret :: train函数可以在固定的alpha和lambda交叉验证AUC吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆