如何从回归树 rpart 对象生成预测区间? [英] How to generate a prediction interval from a regression tree rpart object?
问题描述
如何从使用 rpart 拟合的回归树生成预测区间?
How do you generate a prediction interval from a regression tree that is fit using rpart?
据我所知,回归树以叶节点的平均值为条件对响应进行建模.我不知道如何从模型中获取叶节点的方差,但我想做的是使用叶节点的均值和方差进行模拟以获得预测区间.
It is my understanding that a regression tree models the response conditional on the mean of the leaf nodes. I don't know how to get the variance for a leaf node from the model, but what I would like to do is simulate using the mean and variance for a leaf node to obtain a prediction interval.
Predict.rpart() 没有给一个区间选项.
Predict.rpart() doesn't give an option for interval.
示例:我用虹膜数据拟合了一棵树,但预测没有选项间隔"
Example: I fit a tree with iris data, but predict doesn't have an option, "interval"
> r1 <- rpart(Sepal.Length ~ ., cp = 0.001, data = iris[1:nrow(iris)-1,])
> predict(r1,newdata=iris[nrow(iris),],type = "interval")
Error in match.arg(type) :
'arg' should be one of "vector", "prob", "class", "matrix"
推荐答案
我不清楚置信区间对于回归树意味着什么,因为它们不是线性模型等经典统计模型.我主要看到两种用途:表征树的确定性或表征树每片叶子的预测精度.以下是对这些可能性中的每一种的答案.
It is not clear to me what confidence intervals would mean for regression trees as those are not classical statistical models like linear models. And I see mainly two uses: characterising the certainty of your tree or characterizing the precision of the prediction for each leaf of the tree. Hereafter an answer for each of these possibilities.
如果您正在寻找分裂节点的置信度值,那么 party
会直接提供该值,因为它使用置换测试并在统计上确定哪些变量最重要以及每个分裂的 p 值.如此处.
If you are looking for a confidence value for a split node, then party
provides that directly as it uses permutation tests and statistically determine which variables are most important and the p-value attached to each split. A significant superiority of party
's ctree
function over rpart
as explained here.
第三,如果您正在寻找每个叶子中值的区间置信度,那么叶子中观测值的 [0.025,0.975] 分位数区间很可能就是您要寻找的.party
中的默认图在显示每个叶子的输出值的箱线图时采用类似的方法:
Third, if you are looking for a confidence of interval for the value in each leaf, then the [0.025,0.975] quantiles interval for the observations in the leaf is most likely what you are looking for. The default plots in party
takes a similar approach when displaying boxplots for the output value for each leaf:
library("party")
r2 <- ctree(Sepal.Length ~ .,data=iris)
plot(r2)
检索相应的间隔可以简单地通过:
Retrieving the corresponding intervals can simply be done by:
iris$leaf <- predict(r2, type="node")
CIleaf <- aggregate(iris$Sepal.Length,
by=list(leaf=iris$leaf),
quantile,
prob=c(0.025, 0.25, 0.75, 0.975))
而且很容易形象化:
plot(as.factor(CIleaf$leaf), CIleaf[, 2],
ylab="Sepal length", xlab="Regression tree leaf")
legend("bottomright",
c(" 0.975 quantile", " 0.75 quantile", " mean",
" 0.25 quantile", " 0.025 quantile"),
pch=c("-", "_", "_", "_", "-"),
pt.lwd=0.5, pt.cex=c(1, 1, 2, 1, 1), xjust=1)
这篇关于如何从回归树 rpart 对象生成预测区间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!