使用插入符号优化二进制分类的偏差 [英] Using caret to optimize for deviance with binary classification
问题描述
(示例是从我有这个例子:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
data <- data.frame(predictors, diagnosis)
tuneGrid <- expand.grid(interaction.depth = 1:2, n.trees = 100, shrinkage = 0.1)
trainControl <- trainControl(method = "cv", number = 5, verboseIter = TRUE)
gbmFit <- train(diagnosis ~ ., data = data, method = "gbm", trControl = trainControl, tuneGrid = tuneGrid)
但是,可以说我想针对偏差(这是我认为gbm默认返回的值)而不是准确性进行优化.我知道trainControl提供了summaryFunction参数.如何编写将针对偏差进行优化的summaryFunction?
But let's say I want to optimize with regards to deviance (which is what I believe gbm returns by default) instead of accuracy. I know that trainControl offers a summaryFunction argument. How do I write a summaryFunction that will optimize for deviance?
推荐答案
距离只是对数可能性的两倍(负).对于一次试验的二项式数据,即:
Deviance is just (minus) twice the log-likelihood. For binomial data with a single trial, that is:
-2 \sum_{i=1}^n y_i log(\pi_i) + (1 - y_i)*log(1-\pi_i)
y_i
是第一类的二进制指示符,\pi
是处于第一类的概率.
y_i
is a binary indicator for the first class and \pi
is the probability of being in the first class.
这里是一个简单的示例,可以通过重新计算训练集的偏差来复制GLM中的偏差:
Here is a simple example to reproduce the deviance in a GLM (by re-calculating the training set deviance):
> library(caret)
> set.seed(1)
> dat <-twoClassSim(200)
> fit1 <- glm(Class ~ ., data = dat, family = binomial)
> ## glm() models the last class level
> prob_class1 <- 1 - predict(fit1, dat[, -ncol(dat)], type = "response")
> is_class1 <- ifelse(dat$Class == "Class1", 1, 0)
> -2*sum(is_class1*log(prob_class1) + ((1-is_class1)*log(1-prob_class1)))
[1] 112.7706
> fit1
Call: glm(formula = Class ~ ., family = binomial, data = dat)
<snip>
Degrees of Freedom: 199 Total (i.e. Null); 184 Residual
Null Deviance: 275.3
Residual Deviance: 112.8 AIC: 144.8
train
的基本功能是:
dev_summary <- function(data, lev = NULL, model = NULL) {
is_class1 <- ifelse(data$obs == lev[1], 1, 0)
prob_class1 <- data[, lev[1]]
c(deviance = -2*sum(is_class1*log(prob_class1) +
((1-is_class1)*log(1-prob_class1))),
twoClassSummary(data, lev = lev))
}
ctrl <- trainControl(summaryFunction = dev_summary,
classProbs = TRUE)
gbm_grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1))
set.seed(1)
fit2 <- train(Class ~ ., data = dat,
method = "gbm",
trControl = ctrl,
tuneGrid = gbm_grid,
metric = "deviance",
verbose = FALSE)
请注意,如果\pi
非常接近零或一,您将需要考虑采取一些措施.
Note that you will need to think of something to do if \pi
is very near zero or one.
最大
这篇关于使用插入符号优化二进制分类的偏差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!