使用预测概率在插入符号包中自定义性能函数 [英] Custom Performance Function in caret Package using predicted Probability
问题描述
这篇 SO 帖子是关于使用自定义性能caret
包中的测量函数.您想找到最佳预测模型,因此您可以构建多个模型,并通过计算通过比较观察值和预测值得出的单个指标来比较它们.有计算此指标的默认函数,但您也可以定义自己的指标函数.此自定义函数必须将 obs 和预测值作为输入.
This SO post is about using a custom performance measurement function in the caret
package. You want to find the best prediction model, so you build several and compare them by calculating a single metric that is drawn from comparing the observation and the predicted value. There are default functions to calculate this metric, but you can also define your own metric-function. This custom functions must take obs and predicted values as input.
在分类问题(假设只有两个类别)中,预测值是 0
或 1
.但是,我需要评估的也是模型中计算出的概率.有什么办法可以做到这一点吗?
In classification problems (let's say only two classes) the predicted value is 0
or 1
. However, what I need to evaluate is also the probability calculated in the model. Is there any way to achieve this?
原因是在某些应用程序中,您需要知道 1
预测实际上是有 99% 的概率还是 51% 的概率 - 而不仅仅是预测是 1 还是 0.
The reason is that there are applications where you need to know whether a 1
prediction is actually with a 99% probability or with a 51% probability - not just if the prediction is 1 or 0.
有人可以帮忙吗?
编辑好的,让我试着解释得更好一点.在 5.5.5 (Alternate Performance Metrics) 下的 caret
包的文档中,描述了如何像这样使用自己的自定义性能函数
Edit
OK, so let me try to explain a little bit better. In the documentation of the caret
package under 5.5.5 (Alternate Performance Metrics) there is a description how to use your own custom performance function like so
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
twoClassSummary
是本例中的自定义性能函数.这里提供的函数需要将带有 obs
和 pred
的数据帧或矩阵作为输入.这就是重点 - 我想使用一个不接受观察者和预测的函数,而是观察和预测的概率.
twoClassSummary
is the custom performance function in this example. The function provided here needs to take as input a dataframe or matrix with obs
and pred
. And here's the point - I want to use a function that does not take observerd and predicted, but observed and predicted probability.
还有一件事:
也欢迎来自其他软件包的解决方案.我唯一没有寻找的是这就是您编写自己的交叉验证函数的方式."
Solutions from other packages are also welcome. The only thing I am not looking for is "This is how you write your own cross-validation function."
推荐答案
当您在 trainControl
中指定 classProbs = TRUE
时,Caret 确实支持将类概率传递给自定义汇总函数.在这种情况下,创建自定义汇总函数时的 data
参数将有额外的两列命名为包含每个类的概率的类.这些类的名称将在 lev
参数中,该参数是一个长度为 2 的向量.
Caret does support passing class probabilities to custom summary functions when you specify classProbs = TRUE
in trainControl
. In that case the data
argument when creating a custom summary function will have additional two columns named as classes containing the probability of each class. Names of these classes will be in the lev
argument which is a vector of length 2.
参见示例:
library(caret)
library(mlbench)
data(Sonar)
自定义摘要 LogLoss:
Custom summary LogLoss:
LogLoss <- function (data, lev = NULL, model = NULL){
obs <- data[, "obs"] #truth
cls <- levels(obs) #find class names
probs <- data[, cls[2]] #use second class name to extract probs for 2nd clas
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability, this line and bellow is just logloss calculation, irrelevant for your question
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLoss") #important since this is specified in call to train. Output can be a named vector of multiple values.
out
}
fitControl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = LogLoss)
fit <- train(Class ~.,
data = Sonar,
method = "rpart",
metric = "LogLoss" ,
tuneLength = 5,
trControl = fitControl,
maximize = FALSE) #important, depending on calculated performance measure
fit
#output
CART
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 166, 166, 166, 167, 167
Resampling results across tuning parameters:
cp LogLoss
0.00000000 1.1220902
0.01030928 1.1220902
0.05154639 1.1017268
0.06701031 1.0694052
0.48453608 0.6405134
LogLoss was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.4845361.
或者使用包含类级别并定义一些错误检查的 lev
参数
Alternatively use the lev
argument which contains the class levels and define some error checking
LogLoss <- function (data, lev = NULL, model = NULL){
if (length(lev) > 2) {
stop(paste("Your outcome has", length(lev), "levels. The LogLoss() function isn't appropriate."))
}
obs <- data[, "obs"] #truth
probs <- data[, lev[2]] #use second class name
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLoss")
out
}
查看插入书的这一部分:https://topepo.github.io/caret/model-training-and-tuning.html#metrics
Check out this section of caret book: https://topepo.github.io/caret/model-training-and-tuning.html#metrics
了解更多信息.如果您打算使用插入符号,即使您不是一本好书,也值得一读.
for additional info. Great book to read if you plan on using caret and even if you are not its a good read.
这篇关于使用预测概率在插入符号包中自定义性能函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!