使用预测概率在插入符号包中自定义性能函数 [英] Custom Performance Function in caret Package using predicted Probability

查看:52
本文介绍了使用预测概率在插入符号包中自定义性能函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这篇 SO 帖子是关于使用自定义性能caret 包中的测量函数.您想找到最佳预测模型,因此您可以构建多个模型,并通过计算通过比较观察值和预测值得出的单个指标来比较它们.有计算此指标的默认函数,但您也可以定义自己的指标函数.此自定义函数必须将 obs 和预测值作为输入.

This SO post is about using a custom performance measurement function in the caret package. You want to find the best prediction model, so you build several and compare them by calculating a single metric that is drawn from comparing the observation and the predicted value. There are default functions to calculate this metric, but you can also define your own metric-function. This custom functions must take obs and predicted values as input.

在分类问题(假设只有两个类别)中,预测值是 01.但是,我需要评估的也是模型中计算出的概率.有什么办法可以做到这一点吗?

In classification problems (let's say only two classes) the predicted value is 0 or 1. However, what I need to evaluate is also the probability calculated in the model. Is there any way to achieve this?

原因是在某些应用程序中,您需要知道 1 预测实际上是有 99% 的概率还是 51% 的概率 - 而不仅仅是预测是 1 还是 0.

The reason is that there are applications where you need to know whether a 1 prediction is actually with a 99% probability or with a 51% probability - not just if the prediction is 1 or 0.

有人可以帮忙吗?

编辑好的,让我试着解释得更好一点.在 5.5.5 (Alternate Performance Metrics) 下的 caret 包的文档中,描述了如何像这样使用自己的自定义性能函数

Edit OK, so let me try to explain a little bit better. In the documentation of the caret package under 5.5.5 (Alternate Performance Metrics) there is a description how to use your own custom performance function like so

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10,
                           ## Estimate class probabilities
                           classProbs = TRUE,
                           ## Evaluate performance using 
                           ## the following function
                           summaryFunction = twoClassSummary)

twoClassSummary 是本例中的自定义性能函数.这里提供的函数需要将带有 obspred 的数据帧或矩阵作为输入.这就是重点 - 我想使用一个不接受观察者和预测的函数,而是观察和预测的概率.

twoClassSummary is the custom performance function in this example. The function provided here needs to take as input a dataframe or matrix with obs and pred. And here's the point - I want to use a function that does not take observerd and predicted, but observed and predicted probability.

还有一件事:

也欢迎来自其他软件包的解决方案.我唯一没有寻找的是这就是您编写自己的交叉验证函数的方式."

Solutions from other packages are also welcome. The only thing I am not looking for is "This is how you write your own cross-validation function."

推荐答案

当您在 trainControl 中指定 classProbs = TRUE 时,Caret 确实支持将类概率传递给自定义汇总函数.在这种情况下,创建自定义汇总函数时的 data 参数将有额外的两列命名为包含每个类的概率的类.这些类的名称将在 lev 参数中,该参数是一个长度为 2 的向量.

Caret does support passing class probabilities to custom summary functions when you specify classProbs = TRUE in trainControl. In that case the data argument when creating a custom summary function will have additional two columns named as classes containing the probability of each class. Names of these classes will be in the lev argument which is a vector of length 2.

参见示例:

library(caret)
library(mlbench)
data(Sonar)

自定义摘要 LogLoss:

Custom summary LogLoss:

LogLoss <- function (data, lev = NULL, model = NULL){ 
  obs <- data[, "obs"] #truth
  cls <- levels(obs) #find class names
  probs <- data[, cls[2]] #use second class name to extract probs for 2nd clas
  probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability, this line and bellow is just logloss calculation, irrelevant for your question 
  logPreds <- log(probs)        
  log1Preds <- log(1 - probs)
  real <- (as.numeric(data$obs) - 1)
  out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
  names(out) <- c("LogLoss") #important since this is specified in call to train. Output can be a named vector of multiple values. 
  out
}

fitControl <- trainControl(method = "cv",
                           number = 5,
                           classProbs = TRUE,
                           summaryFunction = LogLoss)


fit <-  train(Class ~.,
             data = Sonar,
             method = "rpart", 
             metric = "LogLoss" ,
             tuneLength = 5,
             trControl = fitControl,
             maximize = FALSE) #important, depending on calculated performance measure

fit
#output
CART 

208 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 166, 166, 166, 167, 167 
Resampling results across tuning parameters:

  cp          LogLoss  
  0.00000000  1.1220902
  0.01030928  1.1220902
  0.05154639  1.1017268
  0.06701031  1.0694052
  0.48453608  0.6405134

LogLoss was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.4845361.

或者使用包含类级别并定义一些错误检查的 lev 参数

Alternatively use the lev argument which contains the class levels and define some error checking

LogLoss <- function (data, lev = NULL, model = NULL){ 
 if (length(lev) > 2) {
        stop(paste("Your outcome has", length(lev), "levels. The LogLoss() function isn't appropriate."))
    }
  obs <- data[, "obs"] #truth
  probs <- data[, lev[2]] #use second class name
  probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
  logPreds <- log(probs)        
  log1Preds <- log(1 - probs)
  real <- (as.numeric(data$obs) - 1)
  out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
  names(out) <- c("LogLoss")
  out
}

查看插入书的这一部分:https://topepo.github.io/caret/model-training-and-tuning.html#metrics

Check out this section of caret book: https://topepo.github.io/caret/model-training-and-tuning.html#metrics

了解更多信息.如果您打算使用插入符号,即使您不是一本好书,也值得一读.

for additional info. Great book to read if you plan on using caret and even if you are not its a good read.

这篇关于使用预测概率在插入符号包中自定义性能函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆