partykit:在终端节点箱形图上方显示终端节点百分位数值 [英] partykit: Displaying terminal node percentile values above terminal node boxplots
问题描述
我正在尝试使用 partykit
绘制由 rpart
生成的回归树.假设使用的公式是 y〜x1 + x2 + x3 + ... + xn
.我想实现的是一棵在终端节点上具有箱线图的树,顶部有一个标签,列出分配给每个节点的观测值的y值分布的第10、50和90%,即在表示箱线图的上方在每个终端节点上,我都想显示一个标签,例如第10个百分位= $ 200,均值= $ 247,第90个百分位= $ 292".
I'm trying to plot a regression tree generated with rpart
using partykit
. Let's suppose the formula used is y ~ x1 + x2 + x3 + ... + xn
. What I would like to achieve is a tree with boxplots in terminal nodes, with a label on top listing the 10th, 50th, and 90th percentiles of the distribution of the y values for the observations assigned to each node, i.e., above the boxplot representing each terminal node, I would like to display a label like "10th percentile = $200, mean = $247, 90th percentile = $292."
下面的代码生成所需的树:
The code below generates the desired tree:
library("rpart")
fit <- rpart(Price ~ Mileage + Type + Country, cu.summary)
library("partykit")
tree.2 <- as.party(fit)
以下代码生成终端图,但终端节点上没有所需的标签:
The following code generates the terminal plots but without the desired labels on the terminal nodes:
plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
ylines = 3, cex = 0.5, id = TRUE))
如果我可以显示一个节点的平均y值,那么用百分位数扩展标签应该足够容易,因此我的第一步是在每个终端节点上方仅显示其平均y值.
If I can display a mean y-value for a node, then it should be easy enough to augment the label with percentiles, so my first step is to display, above each terminal node, just its mean y-value.
我知道我可以使用以下代码检索节点(此处为节点#12)内的y平均值:
I know I can retrieve the mean y-value within a node (here node #12) with code such as this:
colMeans(tree.2[12]$fitted[2])
因此,我尝试创建一个公式,并使用boxplot面板生成函数的 mainlab
参数来生成包含该均值的标签:
So I tried to create a formula and use the mainlab
parameter of the boxplot panel-generating function to generate a label containing this mean:
labf <- function(node) colMeans(node$fitted[2])
plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
ylines = 3, cex = 0.5, id = TRUE, mainlab = tf))
不幸的是,这会生成错误消息:
Unfortunately, this generates the error message:
Error in mainlab(names(obj)[nid], sum(wn)) : unused argument (sum(wn)).
但是,这似乎是正确的,因为如果我使用的话:
But it seems this is on the right track, since if I use:
plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
ylines = 3, cex = 0.5, id = TRUE, mainlab = colMeans(tree.2$fitted[2])))
然后我在显示的根节点上获得正确的平均y值.我希望能帮助您解决上述错误,以便为每个单独的终端节点显示平均y值.从那里开始,应该很容易地添加其他百分位数,并很好地设置格式.
then I get the correct mean y-value at the root node displayed. I would appreciate help with fixing the error described above so that I show the mean y-values for each separate terminal node. From there, it should be easy to add in the other percentiles and format things nicely.
推荐答案
原则上,您处在正确的轨道上.但是如果 mainlab
应该是一个函数,则它不是 node
的函数,而是 id
和 nobs
的函数,请参见?node_boxplot
.您还可以使用整棵树的 fitting
数据,更轻松地为所有终端节点计算均值表(或某些分位数):
In principle, you are on the right track. But if mainlab
should be a function, it is not a function of the node
but of id
and nobs
, see ?node_boxplot
. Also you can compute the table of means (or some quantiles) more easily for all terminal nodes using the fitted
data for the whole tree:
tab <- tapply(tree.2$fitted[["(response)"]],
factor(tree.2$fitted[["(fitted)"]], levels = 1:length(tree.2)),
FUN = mean)
然后,您可以通过四舍五入/格式化为绘图作准备:
Then you can prepare this for plotting by rounding/formatting:
tab <- format(round(tab, digits = 3))
tab
## 1 2 3 4 5 6
## " NA" " NA" " NA" " 7629.048" " NA" "12241.552"
## 7 8 9 10 11 12
## "14846.895" "22317.727" " NA" " NA" "17607.444" "21499.714"
## 13
## "27646.000"
并将其添加到显示中,为 mainlab
编写您自己的帮助函数:
And for adding this into the display, write your own helper function for the mainlab
:
mlab <- function(id, nobs) paste("Mean =", tab[id])
plot(tree.2, tp_args = list(mainlab = mlab))
这篇关于partykit:在终端节点箱形图上方显示终端节点百分位数值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!