插入符号包错误 - 分类 v 回归 [英] Error with caret package - classification v regression
问题描述
我是一名精算专业的学生,正在准备 12 月即将举行的预测分析考试.练习的一部分是使用带有 caret
和 xgbTree
的 boosting 来构建模型.看下面的代码,caravan数据集来自ISLR
包:
I am an actuarial student preparing for an upcoming predictive analytics exam in December. Part of an exercise is to build a model using boosting with caret
and xgbTree
. See the code below, the caravan dataset is from the ISLR
package:
library(caret)
library(ggplot2)
set.seed(1000)
data.Caravan <- read.csv(file = "Caravan.csv")
data.Caravan$Purchase <- factor(data.Caravan$Purchase)
levels(data.Caravan$Purchase) <- c("No", "Yes")
data.Caravan.train <- data.Caravan[1:1000, ]
data.Caravan.test <- data.Caravan[1001:nrow(data.Caravan), ]
grid <- expand.grid(max_depth = c(1:7),
nrounds = 500,
eta = c(.01, .05, .01),
colsample_bytree = c(.5, .8),
gamma = 0,
min_child_weight = 1,
subsample = .6)
control <- trainControl(method = "cv",
number = 4,
classProbs = TRUE,
sampling = c("up", "down"))
caravan.boost <- train(formula = Purchase ~ .,
data = data.Caravan.train,
method = "xgbTree",
metric = "Accuracy",
trControl = control,
tuneGrid = grid)
expand.grid
和 trainControl
中的定义是由问题指定的,但我一直收到错误:
The definitions in expand.grid
and trainControl
were specified by the problem, but I keep getting an error:
错误:抽样方法仅用于分类问题
Error: sampling methods are only implemented for classification problems
如果我从 trainControl
中删除采样方法,我会收到一个新错误,指出度量精度不适用于回归模型".如果我删除准确度指标,我会收到一条错误消息
If I remove the sampling method from trainControl
, I get a new error that states "Metric Accuracy not applicable for regression models". If I remove the Accuracy metric, I get an error stating
无法计算回归的类概率"和名称错误(res$trainingData)%in% as.character(form[[2]]):参数form";缺少,没有默认值
cannnot compute class probabilities for regression" and "Error in names(res$trainingData) %in% as.character(form[[2]]) : argument "form" is missing, with no default"
最终的问题是插入符号将问题定义为回归,而不是分类,即使目标变量设置为因子变量并且 classProbs
设置为 TRUE.有人能解释一下如何告诉插入符号运行分类而不是回归吗?
Ultimately the problem is that caret is defining the problem as regression, not classification, even though the target variable is set as a factor variable and classProbs
is set to TRUE. Can someone explain how to tell caret to run classification and not regression?
推荐答案
caret::train
没有 formula
参数,而是一个 form
在其中指定公式的参数.因此,例如这有效:
caret::train
does not have a formula
argument, but rather a form
argument in which you specify the formula. So for instance this works:
caravan.boost <- train(form = Purchase ~ .,
data = data.Caravan.train,
method = "xgbTree",
metric = "Accuracy",
trControl = control,
tuneGrid = grid)
#output:
eXtreme Gradient Boosting
1000 samples
85 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 751, 749, 750, 750
Addtional sampling using up-sampling
Resampling results across tuning parameters:
eta max_depth colsample_bytree Accuracy Kappa
0.01 1 0.5 0.7020495 0.10170007
0.01 1 0.8 0.7100335 0.09732773
0.01 2 0.5 0.7730581 0.12361444
0.01 2 0.8 0.7690620 0.11293561
0.01 3 0.5 0.8330506 0.14461709
0.01 3 0.8 0.8290146 0.06908344
0.01 4 0.5 0.8659949 0.07396586
0.01 4 0.8 0.8749790 0.07451637
0.01 5 0.5 0.8949792 0.07599005
0.01 5 0.8 0.8949792 0.07525191
0.01 6 0.5 0.9079873 0.09766492
0.01 6 0.8 0.9099793 0.10420720
0.01 7 0.5 0.9169833 0.11769151
0.01 7 0.8 0.9119753 0.10873268
0.05 1 0.5 0.7640699 0.08281792
0.05 1 0.8 0.7700580 0.09201503
0.05 2 0.5 0.8709909 0.09034807
0.05 2 0.8 0.8739990 0.10440898
0.05 3 0.5 0.9039792 0.12166348
0.05 3 0.8 0.9089832 0.11850402
0.05 4 0.5 0.9149793 0.11602447
0.05 4 0.8 0.9119713 0.11207786
0.05 5 0.5 0.9139633 0.11853793
0.05 5 0.8 0.9159754 0.11968085
0.05 6 0.5 0.9219794 0.11744643
0.05 6 0.8 0.9199794 0.12803204
0.05 7 0.5 0.9179873 0.08701058
0.05 7 0.8 0.9179793 0.10702619
Tuning parameter 'nrounds' was held constant at a value of 500
Tuning parameter 'gamma' was held constant
at a value of 0
Tuning parameter 'min_child_weight' was held constant at a value of 1
Tuning
parameter 'subsample' was held constant at a value of 0.6
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 500, max_depth = 6, eta = 0.05, gamma =
0, colsample_bytree = 0.5, min_child_weight = 1 and subsample = 0.6.
您也可以使用非公式接口,在其中分别指定 x
和 y
:
You can also use the non formula interface in which you specify the x
and y
separately:
caravan.boost <- train(x = data.Caravan.train[,-ncol(data.Caravan.train)],
y = data.Caravan.train$Purchase,
method = "xgbTree",
metric = "Accuracy",
trControl = control,
tuneGrid = grid)
请注意,当 x
中有因子变量时,这两种规范方式并不总是产生相同的结果,因为对于大多数算法,公式接口调用了 model.matrix
.
do note that these two ways of specification do not always produce the same result when there are factor variables in x
since the formula interface calls model.matrix
for most algorithms.
获取数据:
library(ISLR)
data(Caravan)
这篇关于插入符号包错误 - 分类 v 回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!