为什么使用 caret::train(..., method = "rpart") 的结果与 rpart::rpart(...) 不同? [英] Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?

查看:39
本文介绍了为什么使用 caret::train(..., method = "rpart") 的结果与 rpart::rpart(...) 不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在参加 Coursera 实用机器学习课程,该课程要求使用此 数据集.在将数据拆分为 trainingtesting 数据集后,基于感兴趣的结果(此处标记为 y,但实际上是 classe 数据集中的变量):

I'm taking part in the Coursera Practical Machine Learning course, and the coursework requires building predictive models using this dataset. After splitting the data into training and testing datasets, based on the outcome of interest (herewith labelled y, but is in fact the classe variable in the dataset):

inTrain <- createDataPartition(y = data$y, p = 0.75, list = F) 
training <- data[inTrain, ] 
testing <- data[-inTrain, ] 

我尝试了两种不同的方法:

I have tried 2 different methods:

modFit <- caret::train(y ~ ., method = "rpart", data = training)
pred <- predict(modFit, newdata = testing)
confusionMatrix(pred, testing$y)

对比

modFit <- rpart::rpart(y ~ ., data = training)
pred <- predict(modFit, newdata = testing, type = "class")
confusionMatrix(pred, testing$y)

我认为它们会给出相同或非常相似的结果,因为初始方法加载了 'rpart' 包(向我建议它使用这个包作为方法).但是,计时(caret 慢得多)&结果大不相同:

I would assume they would give identical or very similar results, as the initial method loads the 'rpart' package (suggesting to me it uses this package for the method). However, the timings (caret much slower) & results are very different:

方法一(插入符号):

Confusion Matrix and Statistics

Reference
Prediction    A    B    C    D    E
         A 1264  374  403  357  118
         B   25  324   28  146  124
         C  105  251  424  301  241
         D    0    0    0    0    0
         E    1    0    0    0  418

方法二(rpart):

Confusion Matrix and Statistics

Reference 
Prediction    A    B    C    D    E
         A 1288  176   14   79   25
         B   36  569   79   32   68
         C   31   88  690  121  113
         D   14   66   52  523   44
         E   26   50   20   49  651

如您所见,第二种方法是更好的分类器 - 第一种方法对于 D 类和E.

As you can see, the second approach is a better classifier - the first method is very poor for classes D & E.

我意识到这可能不是问这个问题的最合适的地方,但我真的很感激能更深入地了解这个问题和相关问题.caret 看起来是一个很好的统一方法和调用语法的包,但我现在犹豫要不要使用它.

I realise this may not be the most appropriate place to ask this question, but I would really appreciate a greater understanding of this and related issues. caret seems like a great package to unify the methods and call syntax, but I'm now hesitant to use it.

推荐答案

caret 实际上做了更多的事情.特别是,它使用交叉验证来优化模型超参数.在您的情况下,它会尝试 cp 的三个值(键入 modFit,您将看到每个值的准确度结果),而 rpart 仅使用0.01 除非你另有说明(参见 ?rpart.control).交叉验证也需要更长的时间,特别是因为 caret 默认使用引导程序.

caret actually does quite a bit more under the hood. In particular, it uses cross-validation to optimize the model hyperparameters. In your case, it tries three values of cp (type modFit and you'll see accuracy results for each value), whereas rpart just uses 0.01 unless you tell it otherwise (see ?rpart.control). The cross-validation will also take longer, especially since caret uses bootstrapping by default.

为了得到类似的结果,你需要禁用交叉验证并指定cp:

In order to get similar results, you need to disable cross-validation and specify cp:

modFit <- caret::train(y ~ ., method = "rpart", data = training,
                       trControl=trainControl(method="none"),
                       tuneGrid=data.frame(cp=0.01))

此外,您应该为两个模型使用相同的随机种子.

In addition, you should use the same random seed for both models.

也就是说,caret 提供的额外功能是一件好事,您可能应该使用 caret.如果您想了解更多信息,它有很好的文档记录,而且作者有一本出色的书,Applied Predictive Modeling.

That said, the extra functionality that caret provides is a Good Thing, and you should probably just go with caret. If you want to learn more, it's well-documented, and the author has a stellar book, Applied Predictive Modeling.

这篇关于为什么使用 caret::train(..., method = "rpart") 的结果与 rpart::rpart(...) 不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆