插入符号和虚拟变量 [英] Caret and dummy variables

查看:35
本文介绍了插入符号和虚拟变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在调用caret包的train函数时,数据会自动转换,使得所有的因子变量变成一组哑变量.

When calling the train function of the caret package, the data is automatically transformed so that all factor variables are turned into a set of dummy variables.

如何防止这种行为?是否可以说插入符号不要将因子转换为虚拟变量"?

How can I prevent this behaviour? Is it possible to say to caret "don't transform factors into dummy variables"?

例如:

如果我在 etitanic 数据上运行 rpart 算法:

If I run the rpart algorithm on the etitanic data:

library(caret)
library(earth)
data(etitanic)

etitanic$survived[etitanic$survived==1] <- 'YES'
etitanic$survived[etitanic$survived!='YES'] <- 'NO'

model<-train(survived~., data=etitanic, method='rpart')

然后生成的最终模型如下所示:

Then the final model produced looks like so:

> model$finalModel
n= 1046 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 1046 427 NO (0.5917782 0.4082218)  
   2) sexmale>=0.5 658 135 NO (0.7948328 0.2051672)  
     4) age>=9.5 615 110 NO (0.8211382 0.1788618) *
     5) age< 9.5 43  18 YES (0.4186047 0.5813953)  
      10) sibsp>=2.5 16   1 NO (0.9375000 0.0625000) *
      11) sibsp< 2.5 27   3 YES (0.1111111 0.8888889) *
   3) sexmale< 0.5 388  96 YES (0.2474227 0.7525773) *

而如果我直接运行 rpart 算法并构建一棵树,我得到

whereas if I run the rpart algorithm directly and build a tree, I get

> rpart(survived~., data=etitanic)
n= 1046 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 1046 427 NO (0.59177820 0.40822180)  
   2) sex=male 658 135 NO (0.79483283 0.20516717)  
     4) age>=9.5 615 110 NO (0.82113821 0.17886179) *
     5) age< 9.5 43  18 YES (0.41860465 0.58139535)  
      10) sibsp>=2.5 16   1 NO (0.93750000 0.06250000) *
      11) sibsp< 2.5 27   3 YES (0.11111111 0.88888889) *
   3) sex=female 388  96 YES (0.24742268 0.75257732)  
     6) pclass=3rd 152  72 NO (0.52631579 0.47368421)  
      12) age>=1.5 145  66 NO (0.54482759 0.45517241)  
        24) sibsp>=1.5 19   4 NO (0.78947368 0.21052632) *
        25) sibsp< 1.5 126  62 NO (0.50793651 0.49206349)  
          50) age>=27.5 44  15 NO (0.65909091 0.34090909) *
          51) age< 27.5 82  35 YES (0.42682927 0.57317073) *
      13) age< 1.5 7   1 YES (0.14285714 0.85714286) *
     7) pclass=1st,2nd 236  16 YES (0.06779661 0.93220339) *

现在,忘记树木不同的部分.我明白,它们是用不同的参数构建的.但是,它们也建立在不同的数据集之上.例如,插入符号树建立在一个数据集上,其中一列是sexmale",这是由原始数据中的 sex 列制成的虚拟列.

Now, forget the part that the trees are different. I understand, they are built with different parameters. However, they are also build on different data sets. For example, the caret tree was built on a dataset where one column was "sexmale", and this was the dummy column made from the sex column in the original data.

有没有办法告诉 caret 在将数据提供给 rpart 之前不要执行这个虚拟变量的创建?

Is there some way to tell caret not to perform this dummy variable creation before feeding the data to rpart?

推荐答案

为了使插入符号的行为与 rpart 完全一样,我将 trainControl 函数设置为none",并将使用一个记录的 tuneGrid 与cp 设置为 0.01.默认值与 rpart 的默认值完全相同.

To make caret behave exactly like rpart first I set the trainControl function to "none" and will use a tuneGrid of one record with a cp setting of 0.01. The defaults are then exactly the same as the defaults of rpart.

ctrl <- trainControl(method = "none")
#caret formula model
model<-train(survived ~ ., 
             data=etitanic, 
             method='rpart', 
             trControl = ctrl, 
             tuneGrid = expand.grid(cp = 0.01))

# rpart model
model_rp <- rpart(survived~., data=etitanic)

print(model$finalModel)

 1) root 1046 427 NO (0.59177820 0.40822180)  
   2) sexmale>=0.5 658 135 NO (0.79483283 0.20516717)  
     4) age>=9.5 615 110 NO (0.82113821 0.17886179) *
     5) age< 9.5 43  18 YES (0.41860465 0.58139535)  
      10) sibsp>=2.5 16   1 NO (0.93750000 0.06250000) *
      11) sibsp< 2.5 27   3 YES (0.11111111 0.88888889) *
   3) sexmale< 0.5 388  96 YES (0.24742268 0.75257732)  
     6) pclass3rd>=0.5 152  72 NO (0.52631579 0.47368421)  
      12) age>=1.5 145  66 NO (0.54482759 0.45517241)  
        24) sibsp>=1.5 19   4 NO (0.78947368 0.21052632) *
        25) sibsp< 1.5 126  62 NO (0.50793651 0.49206349)  
          50) age>=27.5 44  15 NO (0.65909091 0.34090909) *
          51) age< 27.5 82  35 YES (0.42682927 0.57317073) *
      13) age< 1.5 7   1 YES (0.14285714 0.85714286) *
     7) pclass3rd< 0.5 236  16 YES (0.06779661 0.93220339) *

print(model_rp)


 1) root 1046 427 NO (0.59177820 0.40822180)  
   2) sex=male 658 135 NO (0.79483283 0.20516717)  
     4) age>=9.5 615 110 NO (0.82113821 0.17886179) *
     5) age< 9.5 43  18 YES (0.41860465 0.58139535)  
      10) sibsp>=2.5 16   1 NO (0.93750000 0.06250000) *
      11) sibsp< 2.5 27   3 YES (0.11111111 0.88888889) *
   3) sex=female 388  96 YES (0.24742268 0.75257732)  
     6) pclass=3rd 152  72 NO (0.52631579 0.47368421)  
      12) age>=1.5 145  66 NO (0.54482759 0.45517241)  
        24) sibsp>=1.5 19   4 NO (0.78947368 0.21052632) *
        25) sibsp< 1.5 126  62 NO (0.50793651 0.49206349)  
          50) age>=27.5 44  15 NO (0.65909091 0.34090909) *
          51) age< 27.5 82  35 YES (0.42682927 0.57317073) *
      13) age< 1.5 7   1 YES (0.14285714 0.85714286) *
     7) pclass=1st,2nd 236  16 YES (0.06779661 0.93220339) *

查看这两个模型,您可以看到,即使插入符号将因子和字符转换为具有默认类作为参考类,树也完全相同,节点中的百分比相同.您可以使用 partykit 包并在模型上使用 as.party() 以获得更好的布局.

Looking at both models you can see that even though caret transformed the factors and characters to have a default class as areference class, the tree is exactly the same with the same percentages in the nodes. You could use the partykit package and use as.party() on the models to get a better layout.

但是如果你想在不使用因子的情况下拥有与rpart完全相同的模型,你可以使用默认的使用模型的方式.

But if you want to have the exact same model as rpart without using the factors, you can use the default way of using models.

#caret default model
model_xy <-train(x = etitanic[, -2], 
                 y = etitanic$survived, 
                 method='rpart', 
                 trControl = ctrl, 
                 tuneGrid = expand.grid(cp = 0.01))

print(model_xy$finalModel)

 1) root 1046 427 NO (0.59177820 0.40822180)  
   2) sex=male 658 135 NO (0.79483283 0.20516717)  
     4) age>=9.5 615 110 NO (0.82113821 0.17886179) *
     5) age< 9.5 43  18 YES (0.41860465 0.58139535)  
      10) sibsp>=2.5 16   1 NO (0.93750000 0.06250000) *
      11) sibsp< 2.5 27   3 YES (0.11111111 0.88888889) *
   3) sex=female 388  96 YES (0.24742268 0.75257732)  
     6) pclass=3rd 152  72 NO (0.52631579 0.47368421)  
      12) age>=1.5 145  66 NO (0.54482759 0.45517241)  
        24) sibsp>=1.5 19   4 NO (0.78947368 0.21052632) *
        25) sibsp< 1.5 126  62 NO (0.50793651 0.49206349)  
          50) age>=27.5 44  15 NO (0.65909091 0.34090909) *
          51) age< 27.5 82  35 YES (0.42682927 0.57317073) *
      13) age< 1.5 7   1 YES (0.14285714 0.85714286) *
     7) pclass=1st,2nd 236  16 YES (0.06779661 0.93220339) *

这篇关于插入符号和虚拟变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆