使用 R 中“rpart"包中的生存树来预测新的观察结果 [英] Using a survival tree from the 'rpart' package in R to predict new observations

查看:87
本文介绍了使用 R 中“rpart"包中的生存树来预测新的观察结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 R 中的rpart"包来构建生存树,我希望使用这棵树然后对其他观察结果进行预测.

I'm attempting to use the "rpart" package in R to build a survival tree, and I'm hoping to use this tree to then make predictions for other observations.

我知道有很多关于 rpart 和预测的 SO 问题;但是,我还没有找到任何可以解决(我认为)特定于将 rpart 与Surv"对象一起使用的问题.

I know there have been a lot of SO questions involving rpart and prediction; however, I have not been able to find any that address a problem that (I think) is specific to using rpart with a "Surv" object.

我的特殊问题涉及解释预测"函数的结果.一个例子很有帮助:

My particular problem involves interpreting the results of the "predict" function. An example is helpful:

library(rpart)
library(OIsurv)

# Make Data:
set.seed(4)
dat = data.frame(X1 = sample(x = c(1,2,3,4,5), size = 1000, replace=T))
dat$t = rexp(1000, rate=dat$X1)
dat$t = dat$t / max(dat$t)
dat$e = rbinom(n = 1000, size = 1, prob = 1-dat$t )

# Survival Fit:
sfit = survfit(Surv(t, event = e) ~ 1, data=dat)
plot(sfit)

# Tree Fit:
tfit = rpart(formula = Surv(t, event = e) ~ X1 , data = dat, control=rpart.control(minsplit=30, cp=0.01))
plot(tfit); text(tfit)

# Survival Fit, Broken by Node in Tree:
dat$node = as.factor(tfit$where)
plot( survfit(Surv(dat$t, event = dat$e)~dat$node) )

到目前为止一切顺利.我对这里发生的事情的理解是 rpart 试图将指数生存曲线拟合到我的数据子集.基于这种理解,我相信当我调用 predict(tfit) 时,对于每个观察,我都会得到一个对应于该观察指数曲线参数的数字.因此,例如,如果 predict(fit)[1] 是 .46,那么这意味着对于我原始数据集中的第一个观察,曲线由方程 P(s) 给出= exp(−λt),其中 λ=.46.

So far so good. My understanding of what's going on here is that rpart is attempting to fit exponential survival curves to subsets of my data. Based on this understanding, I believe that when I call predict(tfit), I get, for each observation, a number corresponding to the parameter for the exponential curve for that observation. So, for example, if predict(fit)[1] is .46, then this means for the first observation in my original dataset, the curve is given by the equation P(s) = exp(−λt), where λ=.46.

这似乎正是我想要的.对于每个观察(或任何新观察),我可以获得在给定时间点该观察将存活/死亡的预测概率.(我意识到这可能是一种误解——这些曲线没有给出存活/死亡的概率,而是给出一个间隔的存活概率.不过,这并没有改变下面描述的问题.)

This seems like exactly what I'd want. For each observation (or any new observation), I can get the predicted probability that this observation will be alive/dead for a given time point. ( I'm realizing this is probably a misconception— these curves don't give the probability of alive/dead, but the probability of surviving an interval. This doesn't change the problem described below, though.)

但是,当我尝试使用指数公式时...

However, when I try and use the exponential formula...

# Predict:
# an attempt to use the rates extracted from the tree to
# capture the survival curve formula in each tree node.
rates = unique(predict(tfit))
for (rate in rates) {
  grid= seq(0,1,length.out = 100)
  lines(x= grid, y= exp(-rate*(grid)), col=2)
}

我在这里所做的是以与生存树相同的方式分割数据集,然后使用 survfit 为这些分区中的每一个绘制非参数曲线.那是黑线.我还绘制了与将速率"参数(我认为是)插入(我认为是)生存指数公式的结果相对应的线.

What I've done here is split the dataset in the same way the survival tree did, then used survfit to plot a non-parametric curve for each of these partitions. That's the black lines. I've also drawn lines corresponding to the result of plugging in (what I thought was) the 'rate' parameter into (what I thought was) the survival exponential formula.

我知道非参数拟合和参数拟合不一定相同,但这似乎不止于此:似乎我需要缩放 X 变量或其他东西.

I understand that the non-parametric and the parametric fit shouldn't necessarily be identical, but this seems more than that: it seems like I need to scale my X variable or something.

基本上,我似乎不明白 rpart/survival 在幕后使用的公式.谁能帮我从 (1) rpart 模型到 (2) 任意观察的生存方程?

Basically, I don't seem to understand the formula that rpart/survival is using under the hood. Can anyone help me get from (1) rpart model to (2) a survival equation for any arbitrary observation?

推荐答案

生存数据在内部以指数方式缩放,因此根节点中的预测速率始终固定为 1.000.predict() 方法报告的预测结果总是与根节点中的生存相关,即高或低某个因素.有关更多详细信息,请参阅 vignette("longintro", package = "rpart") 中的第 8.4 节.在任何情况下,您报告的 Kaplan-Meier 曲线都与 rpart 小插图中报告的完全一致.

The survival data are scaled internally exponentially so that the predicted rate in the root node is always fixed to 1.000. The predictions reported by the predict() method are then always relative to the survival in the root node, i.e., higher or lower by a certain factor. See Section 8.4 in vignette("longintro", package = "rpart") for more details. In any case, the Kaplan-Meier curves you are reported correspond exactly to what is also reported in the rpart vignette.

如果你想直接获得树中 Kaplan-Meier 曲线的图并得到预测的中位生存时间,你可以将 rpart 树强制为 constpartypartykit 包提供的树:

If you want to obtain directly the plots of the Kaplan-Meier curves in the tree and get predicted median survival times, you can coerce the rpart tree to a constparty tree as provided by the partykit package:

library("partykit")
(tfit2 <- as.party(tfit))
## Model formula:
## Surv(t, event = e) ~ X1
## 
## Fitted party:
## [1] root
## |   [2] X1 < 2.5
## |   |   [3] X1 < 1.5: 0.192 (n = 213)
## |   |   [4] X1 >= 1.5: 0.082 (n = 213)
## |   [5] X1 >= 2.5: 0.037 (n = 574)
## 
## Number of inner nodes:    2
## Number of terminal nodes: 3
##
plot(tfit2)

打印输出显示了中位生存时间和相应的 Kaplan-Meier 曲线的可视化.也可以使用 predict() 方法将 type 参数设置为 "response""prob" 来获得> 分别.

The print output shows the median survival time and the visualization the corresponding Kaplan-Meier curve. Both can also be obtained with the predict() method setting the type argument to "response" and "prob" respectively.

predict(tfit2, type = "response")[1]
##          5 
## 0.03671885 
predict(tfit2, type = "prob")[[1]]
## Call: survfit(formula = y ~ 1, weights = w, subset = w > 0)
## 
##  records    n.max  n.start   events   median  0.95LCL  0.95UCL 
## 574.0000 574.0000 574.0000 542.0000   0.0367   0.0323   0.0408 

作为 rpart 生存树的替代方案,您还可以考虑基于 ctree()(使用 logrank 分数)或完全基于条件推理的非参数生存树使用来自 partykit 包的通用 mob() 基础结构的参数生存树.

As an alternative to the rpart survival trees you might also consider the non-parametric survival trees based on conditional inference in ctree() (using logrank scores) or fully parametric survival trees using the general mob() infrastructure from the partykit package.

这篇关于使用 R 中“rpart"包中的生存树来预测新的观察结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆