如何使用插入符号和 C5.0Cost 更改 R 中的成本矩阵? [英] How to change the cost matrix in R with caret and C5.0Cost?

查看:30
本文介绍了如何使用插入符号和 C5.0Cost 更改 R 中的成本矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在 R 中试验插入符号和 C5.0Cost.到目前为止,我有一个运行良好的基本模型.但是调整参数让我有些头疼.

我似乎无法更改误报的成本.

库(mlbench)数据(声纳)图书馆(插入符号)设置种子(990)inTraining <- createDataPartition(Sonar$Class, p = .5, list = FALSE)训练中培训 <- 声纳 [inTraining,]测试 <- 声纳 [-inTraining,]设置种子(990)fitControl <- trainControl(method="repeatedcv", number=10, repeats=5)statGrid <- expand.grid(trials = 1,模型 = "树",winnow = FALSE,成本=矩阵(c(0, 2,1, 0), 2, 2, byrow=TRUE))设置种子(825)statFit <- train(Class~., data=training, method="C5.0Cost", trControl=fitControl, tuneGrid = statGrid, metric = "Accuracy")statFit[最终模型"]写(捕获.输出(摘要(statFit)),c50model.txt")

<块引用>

R 版本 3.2.1 (2015-06-18)平台:x86_64-w64-mingw32/x64(64位)运行环境:Windows 8 x64 (build 9200)

地区:[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252[4] LC_NUMERIC=C LC_TIME=German_Germany.1252

附加基础包:[1] grid stats 图形 grDevices utils datasets 方法基础

其他附加包:[1] DMwR_0.4.1 plyr_1.8.3 C50_0.1.0-24 caret_6.0-52 ggplot2_1.0.1格子_0.20-31[7] mlbench_2.1-1

通过命名空间加载(且未附加):[1] Rcpp_0.11.6 compiler_3.2.1 nloptr_1.0.4 bitops_1.0-6
[5] xts_0.9-7 class_7.3-12 iterators_1.0.7 tools_3.2.1
[9] rpart_4.1-9 partykit_1.0-3 digest_0.6.8 lme4_1.1-8
[13] nlme_3.1-120 gtable_0.1.2 mgcv_1.8-6 Matrix_1.2-1
[17] foreach_1.4.2 parallel_3.2.1 brglm_0.5-9 SparseM_1.6
[21] proto_0.3-10 e1071_1.6-7 BradleyTerry2_1.0-6 stringr_1.0.0
[25] caTools_1.17.1 gtools_3.5.0 stats4_3.2.1 nnet_7.3-9
[29] living_2.38-1 gdata_2.17.0 minqa_1.2.4 ROCR_1.0-7
[33] TTR_0.23-0 reshape2_1.4.1 car_2.0-26 magrittr_1.5
[37] gplots_2.17.0 scales_0.2.5 codetools_0.2-11 MASS_7.3-40
[41] splines_3.2.1 quantmod_0.4-5 abind_1.4-3 pbkrtest_0.4-2
[45] colorspace_1.2-6 quantreg_5.11 KernSmooth_2.23-14 stringi_0.5-5
[49] munsell_0.4.2 动物园_1.7-12

插入符号 (?) 接受的唯一更改是对误报的更改(上例中设置为 2 的更改).不幸的是,所有其他更改都被忽略了.可以通过在 R 控制台中输入 statFit["finalModel"] 来轻松确认这一点.

解决方案

@JimBoy 我遇到了和你一样的问题.我查看了 github 上 "C5.0Cost" 可以看到代码中将矩阵的左上角设置为1(见cmat对象).

我修改了 modelInfo 中的成本输入,以便您可以为正片和负片添加成本.您现在需要在网格中指定两个成本参数,而不是包含一个成本参数.展开误报 (costFP) 和漏报 (costFN),它们是您要评估的成本向量.

modelInfo <- list(label = "Cost-Sensitive C5.0",library = c("C50", "plyr"),循环 = 函数(网格){循环 <- ddply(grid, c("model", "winnow", "costFP","costFN"),function(x) c(trials = max(x$trials)))子模型<-向量(模式=列表",长度= nrow(循环))for(i in seq(along = loop$trials)){index <- which(grid$model == loop$model[i] &网格$winnow == 循环$winnow[i],网格$costFP[i] == 循环$costFP[i],网格$costFN[i] == 循环$costFN[i])试验 <- 网格 [索引,试验"]submodels[[i]] <- data.frame(trials = trial[trials != loop$trials[i]])}列表(循环 = 循环,子模型 = 子模型)},type = "分类",参数 = data.frame(parameter = c('trials', 'model', 'winnow', "costFP","costFN"),class = c("numeric", "character", "logical", "numeric","numeric"),label = c('# Boosting Iterations', 'Model Type', 'Winnow', "CostFP","CostFN")),网格 = 函数(x,y,len = NULL,搜索 =网格"){c5seq <- if(len == 1) 1 else c(1, 10*((2:min(len, 11)) - 1))expand.grid(trials = c5seq, model = c("tree", "rules"),winnow = c(TRUE, FALSE),成本FP = 1:2,成本FN = 1:2)如果(搜索==网格"){c5seq <- if(len == 1) 1 else c(1, 10*((2:min(len, 11)) - 1))out <- expand.grid(trials = c5seq, model = c("tree", "rules"),winnow = c(TRUE, FALSE), costFP = 1:2, costFN = 1:2)} 别的 {out <- data.frame(trials = sample(1:100, replace = TRUE, size = len),模型 = 样本(c(树",规则"),替换 = TRUE,大小 = len),winnow = 样本(c(真,假),替换 = 真,大小 = len),costFP = runif(len, min = 1, max = 20),costFN = runif(len, min = 1, max = 20))}出去},适合 = 函数(x,y,wts,参数,lev,last,classProbs,...){theDots <- list(...)if(any(names(theDots) == "control")){theDots$control$winnow <- param$winnow} else theDots$control <- C5.0Control(winnow = param$winnow)argList <- list(x = x, y = y, weights = wts, trial = param$trials,规则 = param$model == "规则")cmat <-matrix(c(0, param$costFP, param$costFN, 0), ncol = 2)rownames(cmat) <- colnames(cmat) <- levels(y)if(any(names(theDots) == "costFP")){警告(对于'C5.0Cost',成本是一个调整参数")theDots$costs <- cmat} else argList$costs <- cmatargList <- c(argList, theDots)do.call("C5.0.default", argList)},预测 = 函数(模型拟合,新数据,子模型 = NULL){out <- 预测(模型拟合,新数据)if(!is.null(子模型)){tmp <- 输出out <- vector(mode = "list", length = nrow(submodels) + 1)out[[1]] <- tmpfor(j in seq(along = submodels$trials))out[[j+1]] <- as.character(predict(modelFit, newdata, trial = submodels$trials[j]))}出去},概率 = NULL,预测变量 = 函数(x,...){vars <- C5imp(x, metric = "splits")rownames(vars)[vars$Overall >0]},级别 = 函数(x)x$obsLevels,varImp = function(object, ...) C5imp(object, ...),tags = c("基于树的模型", "基于规则的模型", "隐式特征选择","Boosting", "Ensemble Model", "Cost Sensitive Learning", "Two Class Only",处理丢失的预测数据",接受案例权重"),排序 = 函数(x){x$model <- factor(as.character(x$model), levels = c("rules", "tree"))x[order(x$trials, x$model, !x$winnow, x$costFP,x$costFN),]},修剪 = 函数(x){x$boostResults <- NULLx$size <- NULLx$call <- NULLx$output <- NULLX})

您上面提供的示例可以按如下方式运行,

## 提供的示例图书馆(mlbench)数据(声纳)图书馆(插入符号)设置种子(990)inTraining <- createDataPartition(Sonar$Class, p = .5, list = FALSE)训练中培训 <- 声纳 [inTraining,]测试 <- 声纳 [-inTraining,]设置种子(990)fitControl <- trainControl(method="repeatedcv", number=10, repeats=5)statGrid <- expand.grid(trials = 3,模型 = "树",winnow = FALSE,成本 = 2)设置种子(825)statFit <- train(Class~., data=training, method="C5.0Cost", trControl=fitControl, tuneGrid = statGrid, metric = "Accuracy")## 示例修改为包括误报和负例的成本设置种子(825)statGridMod <- expand.grid(trials = 3,模型 = "树",winnow = FALSE,costFP = c(1,2,3), #新的成本参数costFN = c(3,2,1)) #新的成本参数statFit <- train(Class~., data=training, method=modelInfo, trControl=fitControl, tuneGrid = statGridMod, metric = "Accuracy")状态拟合

I'm currently experimenting with caret and C5.0Cost in R. So far I have a base model that is working fine. But the tuning parameters give me some headaches.

I seem to be unable to change the cost for the false positives.

library(mlbench)
data(Sonar)

library(caret)

set.seed(990)
inTraining <- createDataPartition(Sonar$Class, p = .5, list = FALSE)
inTraining
training <- Sonar[inTraining,]
test <- Sonar[-inTraining,]

set.seed(990)
fitControl <- trainControl(method="repeatedcv", number=10, repeats=5)
statGrid <-  expand.grid(trials = 1,
                     model = "tree",
                     winnow = FALSE,
                     cost = matrix(c(
                         0, 2,
                         1, 0
                     ), 2, 2, byrow=TRUE))

set.seed(825)
statFit <- train(Class~., data=training, method="C5.0Cost", trControl=fitControl, tuneGrid = statGrid, metric = "Accuracy")

statFit["finalModel"]

write(capture.output(summary(statFit)), "c50model.txt")

R version 3.2.1 (2015-06-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200)

locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 [4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] DMwR_0.4.1 plyr_1.8.3 C50_0.1.0-24 caret_6.0-52 ggplot2_1.0.1 lattice_0.20-31 [7] mlbench_2.1-1

loaded via a namespace (and not attached): [1] Rcpp_0.11.6 compiler_3.2.1 nloptr_1.0.4 bitops_1.0-6
[5] xts_0.9-7 class_7.3-12 iterators_1.0.7 tools_3.2.1
[9] rpart_4.1-9 partykit_1.0-3 digest_0.6.8 lme4_1.1-8
[13] nlme_3.1-120 gtable_0.1.2 mgcv_1.8-6 Matrix_1.2-1
[17] foreach_1.4.2 parallel_3.2.1 brglm_0.5-9 SparseM_1.6
[21] proto_0.3-10 e1071_1.6-7 BradleyTerry2_1.0-6 stringr_1.0.0
[25] caTools_1.17.1 gtools_3.5.0 stats4_3.2.1 nnet_7.3-9
[29] survival_2.38-1 gdata_2.17.0 minqa_1.2.4 ROCR_1.0-7
[33] TTR_0.23-0 reshape2_1.4.1 car_2.0-26 magrittr_1.5
[37] gplots_2.17.0 scales_0.2.5 codetools_0.2-11 MASS_7.3-40
[41] splines_3.2.1 quantmod_0.4-5 abind_1.4-3 pbkrtest_0.4-2
[45] colorspace_1.2-6 quantreg_5.11 KernSmooth_2.23-14 stringi_0.5-5
[49] munsell_0.4.2 zoo_1.7-12

The only change that is accepted by caret (?) is a change to the false negatives (the one in the example above that is set to two). All other changes are ignored, unfortunately. One can easily confirm this by typing statFit["finalModel"] to the R-console.

解决方案

@JimBoy I was running into the same issue as you. I took look at the source code on github for the caret wrapper for the "C5.0Cost" you can see that the the upper left of the matrix is set to 1 in the code (see the cmat object).

I modified cost input in modelInfo so that you can add costs to both flase positives and negatives. Instead of including one cost parameter you now have two to specify in the grid.expand false positives (costFP) and false negative (costFN), which are a vector of the costs you want to assess.

modelInfo <- list(label = "Cost-Sensitive C5.0",
            library = c("C50", "plyr"),
            loop = function(grid) {     
              loop <- ddply(grid, c("model", "winnow", "costFP","costFN"),
                            function(x) c(trials = max(x$trials)))                 

              submodels <- vector(mode = "list", length = nrow(loop))
              for(i in seq(along = loop$trials))
              {
                index <- which(grid$model == loop$model[i] & 
                                 grid$winnow == loop$winnow[i],
                               grid$costFP[i] == loop$costFP[i],
                               grid$costFN[i] == loop$costFN[i])
                trials <- grid[index, "trials"] 
                submodels[[i]] <- data.frame(trials = trials[trials != loop$trials[i]])
              }     
              list(loop = loop, submodels = submodels)
            },
            type = "Classification",
            parameters = data.frame(parameter = c('trials', 'model', 'winnow', "costFP","costFN"),
                                    class = c("numeric", "character", "logical", "numeric","numeric"),
                                    label = c('# Boosting Iterations', 'Model Type', 'Winnow', "CostFP","CostFN")),
            grid = function(x, y, len = NULL, search = "grid") {
              c5seq <- if(len == 1)  1 else  c(1, 10*((2:min(len, 11)) - 1))
              expand.grid(trials = c5seq, model = c("tree", "rules"), 
                          winnow = c(TRUE, FALSE),
                          costFP = 1:2,
                          costFN = 1:2)
              if(search == "grid") {
                c5seq <- if(len == 1)  1 else  c(1, 10*((2:min(len, 11)) - 1))
                out <- expand.grid(trials = c5seq, model = c("tree", "rules"), 
                                   winnow = c(TRUE, FALSE), costFP = 1:2, costFN = 1:2)
              } else {
                out <- data.frame(trials = sample(1:100, replace = TRUE, size = len),
                                  model = sample(c("tree", "rules"), replace = TRUE, size = len),
                                  winnow = sample(c(TRUE, FALSE), replace = TRUE, size = len),
                                  costFP = runif(len, min = 1, max = 20),
                                  costFN = runif(len, min = 1, max = 20))
              }
              out    
            },
            fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
              theDots <- list(...)

              if(any(names(theDots) == "control"))
              {                           
                theDots$control$winnow <- param$winnow
              } else theDots$control <- C5.0Control(winnow = param$winnow)

              argList <- list(x = x, y = y, weights = wts, trials = param$trials,
                              rules = param$model == "rules")

              cmat <-matrix(c(0, param$costFP, param$costFN, 0), ncol = 2)
              rownames(cmat) <- colnames(cmat) <- levels(y)
              if(any(names(theDots) == "costFP")){
                warning("For 'C5.0Cost', the costs are a tuning parameter")
                theDots$costs <- cmat
              } else argList$costs <- cmat

              argList <- c(argList, theDots)
              do.call("C5.0.default", argList)
            },
            predict = function(modelFit, newdata, submodels = NULL) {
              out <- predict(modelFit, newdata)

              if(!is.null(submodels))
              {
                tmp <- out
                out <- vector(mode = "list", length = nrow(submodels) + 1)
                out[[1]] <- tmp

                for(j in seq(along = submodels$trials))
                  out[[j+1]] <- as.character(predict(modelFit, newdata, trial = submodels$trials[j]))
              }
              out              
            },
            prob = NULL,
            predictors = function(x, ...) {
              vars <- C5imp(x, metric = "splits")
              rownames(vars)[vars$Overall > 0]
            },
            levels = function(x) x$obsLevels,
            varImp = function(object, ...) C5imp(object, ...),
            tags = c("Tree-Based Model", "Rule-Based Model", "Implicit Feature Selection",
                     "Boosting", "Ensemble Model", "Cost Sensitive Learning", "Two Class Only", 
                     "Handle Missing Predictor Data", "Accepts Case Weights"),
            sort = function(x){
              x$model <- factor(as.character(x$model), levels = c("rules", "tree"))
              x[order(x$trials, x$model, !x$winnow, x$costFP,x$costFN),]
            },
            trim = function(x) {
              x$boostResults <- NULL
              x$size <- NULL
              x$call <- NULL
              x$output <- NULL
              x
            })

The example you provided above can the be run as follows,

## Example provided
library(mlbench)
data(Sonar)

library(caret)


set.seed(990)
inTraining <- createDataPartition(Sonar$Class, p = .5, list = FALSE)
inTraining
training <- Sonar[inTraining,]
test <- Sonar[-inTraining,]



set.seed(990)
fitControl <- trainControl(method="repeatedcv", number=10, repeats=5)


statGrid <-  expand.grid(trials = 3,
                         model = "tree",
                         winnow = FALSE,
                         cost = 2)

set.seed(825)


statFit <- train(Class~., data=training, method="C5.0Cost", trControl=fitControl, tuneGrid = statGrid, metric = "Accuracy")


## Example modified to include costs for both false positives and negatives
set.seed(825)
statGridMod <-  expand.grid(trials = 3,
                            model = "tree",
                            winnow = FALSE,
                            costFP = c(1,2,3), #new cost parameters
                            costFN = c(3,2,1)) #new cost parameters


statFit <- train(Class~., data=training, method=modelInfo, trControl=fitControl, tuneGrid = statGridMod, metric = "Accuracy")

statFit

这篇关于如何使用插入符号和 C5.0Cost 更改 R 中的成本矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆