插入符号包错误 - 分类 v 回归 [英] Error with caret package - classification v regression

查看:75
本文介绍了插入符号包错误 - 分类 v 回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一名精算专业的学生,​​正在准备 12 月即将举行的预测分析考试.练习的一部分是使用带有 caretxgbTree 的 boosting 来构建模型.看下面的代码,caravan数据集来自ISLR包:

I am an actuarial student preparing for an upcoming predictive analytics exam in December. Part of an exercise is to build a model using boosting with caret and xgbTree. See the code below, the caravan dataset is from the ISLR package:

library(caret)
library(ggplot2)
set.seed(1000)
data.Caravan <- read.csv(file = "Caravan.csv")


data.Caravan$Purchase <- factor(data.Caravan$Purchase)
levels(data.Caravan$Purchase) <- c("No", "Yes")


data.Caravan.train <- data.Caravan[1:1000, ]
data.Caravan.test <- data.Caravan[1001:nrow(data.Caravan), ]
grid <- expand.grid(max_depth = c(1:7),
                    nrounds = 500,
                    eta =  c(.01, .05, .01),
                    colsample_bytree = c(.5, .8),
                    gamma = 0,
                    min_child_weight = 1,
                    subsample = .6)

control <- trainControl(method = "cv", 
                        number = 4,
                        classProbs = TRUE,
                        sampling = c("up", "down"))
              
caravan.boost <- train(formula = Purchase ~ .,
                       data =  data.Caravan.train, 
                       method = "xgbTree", 
                       metric = "Accuracy",
                       trControl = control, 
                       tuneGrid = grid)

expand.gridtrainControl 中的定义是由问题指定的,但我一直收到错误:

The definitions in expand.grid and trainControl were specified by the problem, but I keep getting an error:

错误:抽样方法仅用于分类问题

Error: sampling methods are only implemented for classification problems

如果我从 trainControl 中删除采样方法,我会收到一个新错误,指出度量精度不适用于回归模型".如果我删除准确度指标,我会收到一条错误消息

If I remove the sampling method from trainControl, I get a new error that states "Metric Accuracy not applicable for regression models". If I remove the Accuracy metric, I get an error stating

无法计算回归的类概率"和名称错误(res$trainingData)%in% as.character(form[[2]]):参数form";缺少,没有默认值

cannnot compute class probabilities for regression" and "Error in names(res$trainingData) %in% as.character(form[[2]]) : argument "form" is missing, with no default"

最终的问题是插入符号将问题定义为回归,而不是分类,即使目标变量设置为因子变量并且 classProbs 设置为 TRUE.有人能解释一下如何告诉插入符号运行分类而不是回归吗?

Ultimately the problem is that caret is defining the problem as regression, not classification, even though the target variable is set as a factor variable and classProbs is set to TRUE. Can someone explain how to tell caret to run classification and not regression?

推荐答案

caret::train 没有 formula 参数,而是一个 form 在其中指定公式的参数.因此,例如这有效:

caret::train does not have a formula argument, but rather a form argument in which you specify the formula. So for instance this works:

caravan.boost <- train(form = Purchase ~ .,
                       data =  data.Caravan.train, 
                       method = "xgbTree", 
                       metric = "Accuracy",
                       trControl = control, 
                       tuneGrid = grid)

#output:
eXtreme Gradient Boosting 

1000 samples
  85 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 751, 749, 750, 750 
Addtional sampling using up-sampling

Resampling results across tuning parameters:

  eta   max_depth  colsample_bytree  Accuracy   Kappa     
  0.01  1          0.5               0.7020495  0.10170007
  0.01  1          0.8               0.7100335  0.09732773
  0.01  2          0.5               0.7730581  0.12361444
  0.01  2          0.8               0.7690620  0.11293561
  0.01  3          0.5               0.8330506  0.14461709
  0.01  3          0.8               0.8290146  0.06908344
  0.01  4          0.5               0.8659949  0.07396586
  0.01  4          0.8               0.8749790  0.07451637
  0.01  5          0.5               0.8949792  0.07599005
  0.01  5          0.8               0.8949792  0.07525191
  0.01  6          0.5               0.9079873  0.09766492
  0.01  6          0.8               0.9099793  0.10420720
  0.01  7          0.5               0.9169833  0.11769151
  0.01  7          0.8               0.9119753  0.10873268
  0.05  1          0.5               0.7640699  0.08281792
  0.05  1          0.8               0.7700580  0.09201503
  0.05  2          0.5               0.8709909  0.09034807
  0.05  2          0.8               0.8739990  0.10440898
  0.05  3          0.5               0.9039792  0.12166348
  0.05  3          0.8               0.9089832  0.11850402
  0.05  4          0.5               0.9149793  0.11602447
  0.05  4          0.8               0.9119713  0.11207786
  0.05  5          0.5               0.9139633  0.11853793
  0.05  5          0.8               0.9159754  0.11968085
  0.05  6          0.5               0.9219794  0.11744643
  0.05  6          0.8               0.9199794  0.12803204
  0.05  7          0.5               0.9179873  0.08701058
  0.05  7          0.8               0.9179793  0.10702619

Tuning parameter 'nrounds' was held constant at a value of 500
Tuning parameter 'gamma' was held constant
 at a value of 0
Tuning parameter 'min_child_weight' was held constant at a value of 1
Tuning
 parameter 'subsample' was held constant at a value of 0.6
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 500, max_depth = 6, eta = 0.05, gamma =
 0, colsample_bytree = 0.5, min_child_weight = 1 and subsample = 0.6.

您也可以使用非公式接口,在其中分别指定 xy:

You can also use the non formula interface in which you specify the x and y separately:

caravan.boost <- train(x = data.Caravan.train[,-ncol(data.Caravan.train)],
                       y =  data.Caravan.train$Purchase, 
                       method = "xgbTree", 
                       metric = "Accuracy",
                       trControl = control, 
                       tuneGrid = grid)

请注意,当 x 中有因子变量时,这两种规范方式并不总是产生相同的结果,因为对于大多数算法,公式接口调用了 model.matrix.

do note that these two ways of specification do not always produce the same result when there are factor variables in x since the formula interface calls model.matrix for most algorithms.

获取数据:

library(ISLR)
data(Caravan)

这篇关于插入符号包错误 - 分类 v 回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆