在 rpart 和 caret 中使用序数变量而不转换为虚拟分类变量 [英] Using ordinal variables in rpart and caret without converting to dummy categorical variables
问题描述
我正在尝试使用 rpart
在 R 中创建一个序数回归树,预测变量主要是序数数据,在 R 中存储为 factor
.
I am trying to create an ordinal regression tree in R using rpart
, with the predictors mostly being ordinal data, stored as factor
in R.
当我使用 rpart
创建树时,我得到如下内容:
When I created the tree using rpart
, I get something like this:
其中的值是因子值(例如,A170
的标签范围从 -5 到 10).
where the values are the factor values (E.g. A170
has labels ranging from -5 to 10).
但是,当我使用 caret
使用 rpart
来train
数据时,当我提取最终模型时,树不再有序数预测因子.请参阅下面的示例输出树
However, when I use caret
to train
the data using rpart
, when I extract the final model, the tree no longer has ordinal predictors. See below for a sample output tree
如上所示,现在似乎序数变量A170
已经转换为多个虚拟分类值,即第二棵树中的A17010
是A170
值 10
.
As you see above, it seems the ordinal variable A170
now has been converted into multiple dummy categorical value, i.e. A17010
in the second tree is a dummy for A170
of value 10
.
那么,在使用caret
包拟合树时,是否可以保留序数变量而不是将因子变量转换为多个二元指示变量?
So, is it possible to retain ordinal variables instead of converting factor variables into multiple binary indicator variables when fitting trees with the caret
package?
推荐答案
让我们从一个可重现的例子开始:
Let's start with a reproducible example:
set.seed(144)
dat <- data.frame(x=factor(sample(1:6, 10000, replace=TRUE)))
dat$y <- ifelse(dat$x %in% 1:2, runif(10000) < 0.1, ifelse(dat$x %in% 3:4, runif(10000) < 0.4, runif(10000) < 0.7))*1
如您所见,使用 rpart
函数进行训练将因子水平组合在一起:
As you note, training with the rpart
function groups the factor levels together:
library(rpart)
rpart(y~x, data=dat)
我能够使用 train
函数的公式接口重现插入符号包,将因素拆分为它们的各个级别:
I was able to reproduce the caret package splitting up the factors into their individual levels using the formula interface to the train
function:
library(caret)
train(y~x, data=dat, method="rpart")$finalModel
我发现避免按级别拆分因子的解决方案是将原始数据帧输入到 train
函数中,而不是使用公式接口:
The solution I found to avoid splitting factors by level is to input raw data frames to the train
function instead of using the formula interface:
train(x=data.frame(dat$x), y=dat$y, method="rpart")$finalModel
这篇关于在 rpart 和 caret 中使用序数变量而不转换为虚拟分类变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!