rpart:分类回归量与连续回归量的计算时间 [英] rpart: Computational time for categorical vs continuous regressors

查看:46
本文介绍了rpart:分类回归量与连续回归量的计算时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用 rpart 包将回归树拟合到具有相对较少观察值和数千个采用两个可能值的分类预测变量的数据中.

i am currently using the rpart package to fit a regression tree to a data with relatively few observations and several thousand categorical predictors taking two possible values.

通过在较小的数据上测试包,我知道在这种情况下,我是否将回归量声明为分类变量(即因子)或保持原样(它们被编码为 +/-1)并不重要.

from testing the package out on smaller data i know that in this instance it doesn't matter whether i declare the regressors as categorical (i.e. factors) or leave them as they are (they are coded as +/-1).

但是,我仍然想了解为什么将我的解释变量作为因子传递会显着减慢算法的速度(尤其是因为我将很快获得新数据,其中响应采用 3 个不同的值并将它们视为连续值将不再是一种选择).肯定应该反过来吗?

however, i would still like to understand why passing my explanatory variables as factors significantly slows the algorithm down (not least because i shall soon get new data where response takes 3 diffirent values and treating them as continuous would no longer be an option). surely it should be the other way round?

这是模拟我的数据的示例代码:

here is a sample code emulating my data:

library(rpart)

x <- as.data.frame(matrix(sample(c(-1, +1), 50 * 3000, replace = T), nrow = 50))
y <- rnorm(50)

x.fac <- as.data.frame(lapply(x, factor))

现在比较:

system.time(rpart( y ~ ., data = x, method = 'anova'))

   user  system elapsed 
   1.62    0.21    1.85 

system.time(rpart( y ~ ., data = x.fac, method = 'anova'))

   user  system elapsed 
   246.87  165.91  412.92 

处理每个变量(因子)只有一个潜在的分裂可能性比处理整个范围的潜在分裂(对于连续变量)更简单、更快,所以我对 rpart 行为感到最困惑.任何澄清/建议都会非常感激.

dealing with only one potential split possibility per variable (factors) is simpler and faster than dealing with a whole range of potential splits (for continuous variables), so i am most confused by the rpart behaviour. any clarifications/suggestions would be very apprecaited.

推荐答案

您需要对代码进行概要分析才能确定,但​​如果时间差异不是来自 R,我会感到惊讶在准备模型矩阵时,将每个因子变量转换为两个二元变量.

You need to profile the code to be sure, but I would be surprised if the timing difference did not come from R having to turn each factor variable into two binary variables as it prepares the model matrix.

试试

Rprof("rpartProfile.Rprof")
rpart( y ~ ., data = x.fac, method = 'anova')
Rprof()

summaryRprof("rpartProfile.Rprof")

然后看看时间都花在了哪里.我现在已经完成了:

and look to see where the time is being spent. Which I have now done:

> summaryRprof("rpartProfile.Rprof")
$by.self
                          self.time self.pct total.time total.pct
"[[<-.data.frame"            786.46    72.45     786.56     72.46
"rpart.matrix"               294.26    27.11    1081.78     99.66
"model.frame.default"          1.04     0.10       3.00      0.28
"terms.formula"                0.96     0.09       0.96      0.09
"as.list.data.frame"           0.46     0.04       0.46      0.04
"makepredictcall.default"      0.46     0.04       0.46      0.04
"rpart"                        0.44     0.04    1085.38     99.99
"[[.data.frame"                0.16     0.01       0.42      0.04
"<Anonymous>"                  0.16     0.01       0.18      0.02
"match"                        0.14     0.01       0.22      0.02
"print"                        0.12     0.01       0.12      0.01
"model.matrix.default"         0.10     0.01       0.44      0.04
....

$by.total
                          total.time total.pct self.time self.pct
"rpart"                      1085.38     99.99      0.44     0.04
"rpart.matrix"               1081.78     99.66    294.26    27.11
"[[<-"                        786.62     72.47      0.06     0.01
"[[<-.data.frame"             786.56     72.46    786.46    72.45
"model.frame.default"           3.00      0.28      1.04     0.10
"eval"                          3.00      0.28      0.04     0.00
"eval.parent"                   3.00      0.28      0.00     0.00
"model.frame"                   3.00      0.28      0.00     0.00
"terms.formula"                 0.96      0.09      0.96     0.09
"terms"                         0.96      0.09      0.00     0.00
"makepredictcall"               0.50      0.05      0.04     0.00
"as.list.data.frame"            0.46      0.04      0.46     0.04
"makepredictcall.default"       0.46      0.04      0.46     0.04
"as.list"                       0.46      0.04      0.00     0.00
"vapply"                        0.46      0.04      0.00     0.00
"model.matrix.default"          0.44      0.04      0.10     0.01
"[["                            0.44      0.04      0.02     0.00
"model.matrix"                  0.44      0.04      0.00     0.00
....

$sample.interval
[1] 0.02

$sampling.time
[1] 1085.5

注意上面的大部分时间花在函数rpart.matrix上:

Note from the above that a big chunk of time is spent in function rpart.matrix:

> rpart:::rpart.matrix
function (frame) 
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    for (i in 1:ncol(frame)) {
        if (is.character(frame[[i]])) 
            frame[[i]] <- as.numeric(factor(frame[[i]]))
        else if (!is.numeric(frame[[i]])) 
            frame[[i]] <- as.numeric(frame[[i]])
    }
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

但大部分时间都花在了该函数中的 for 循环上,基本上是转换每一列并将它们添加回数据框.

But it is the for loop in that function where most of the time is spent, essentially converting each column and adding them back to the data frame.

这篇关于rpart:分类回归量与连续回归量的计算时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆