用于插入符号训练的公式和非公式的不同结果 [英] Different results with formula and non-formula for caret training

查看:49
本文介绍了用于插入符号训练的公式和非公式的不同结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到在训练时在插入符号中使用公式和非公式方法会产生不同的结果.此外,公式方法所花费的时间几乎是非公式方法所花费时间的 10 倍.这是预期的吗?

I noticed that using formula and non-formula methods in caret while training produces different results. Also, the time taken for formula method is almost 10x the time taken for the non-formula method. Is this expected ?

> z <- data.table(c1=sample(1:1000,1000, replace=T), c2=as.factor(sample(LETTERS, 1000, replace=T)))

# SYSTEM TIME WITH FORMULA METHOD
# -------------------------------

> system.time(r <- train(c1 ~ ., z, method="rf", importance=T))
   user  system elapsed
376.233   9.241  18.190

> r
1000 samples
   1 predictors

No pre-processing
Resampling: Bootstrap (25 reps)

Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...

Resampling results across tuning parameters:

  mtry  RMSE  Rsquared  RMSE SD  Rsquared SD
  2     295   0.00114   4.94     0.00154
  13    300   0.00113   5.15     0.00151
  25    300   0.00111   5.16     0.00146

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 2.


# SYSTEM TIME WITH NON-FORMULA METHOD
# -------------------------------

> system.time(r <- train(z[,2,with=F], z$c1, method="rf", importance=T))
       user  system elapsed
     34.984   2.977   2.708
    Warning message:
    In randomForest.default(trainX, trainY, mtry = tuneValue$.mtry,  :
  invalid mtry: reset to within valid range
> r
1000 samples
   1 predictors

No pre-processing
Resampling: Bootstrap (25 reps)

Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...

Resampling results

  RMSE  Rsquared  RMSE SD  Rsquared SD
  297   0.00152   6.67     0.00197

Tuning parameter 'mtry' was held constant at a value of 2

推荐答案

您有一个级别数适中的分类预测变量.使用公式接口时,大部分建模函数(包括trainlmglm等)在内部运行model.matrix 处理数据集.这将从任何因子变量创建虚拟变量.非公式接口没有[1].

You have a categorical predictor with a moderate number of levels. When you use the formula interface, most modeling functions (including train, lm, glm, etc) internally run model.matrix to process the data set. This will create dummy variables from any factor variables. The non-formula interface does not [1].

当您使用虚拟变量时,任何拆分中都只使用一个因子水平.树方法以不同的方式处理分类预测变量,但是,当不使用虚拟变量时,随机森林将根据其结果对因子预测变量进行排序,并找到因子水平的 2 向分割 [2].这需要更多时间.

When you use dummy variables, only one factor level is used in any split. Tree methods handle categorical predictors differently but, when dummy variables are not used, random forest will sort the factor predictors based on their outcome and find a 2-way split of the factor levels [2]. This takes more time.

最大

[1] 我讨厌成为那些说在我的书中我展示..."的人之一但在这种情况下,我会.图 14.2 很好地说明了 CART 树的这个过程.

[1] I hate to be one of those people who says "in my book I show..." but in this case I will. Fig. 14.2 has a good illustration of this process for CART trees.

[2] 天啊,我又来了.树木因子的不同表示在 14.1 节中讨论,对于一个数据集的两种方法之间的比较在 14.7 节中显示

[2] God, I'm doing it again. The different representations of factors for trees is discussed in section 14.1 and a comparison between the two approaches for one data set is shown in section 14.7

这篇关于用于插入符号训练的公式和非公式的不同结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆