如何计算落在树的每个节点中的观察值 [英] How to count the observations falling in each node of a tree
问题描述
我目前正在处理 MMST 包中的葡萄酒数据.我已将整个数据集拆分为训练和测试,并构建了如下代码所示的树:
I am currently dealing with wine data in MMST package. I have split the whole dataset into training and test and build a tree like the following codes:
library("rpart")
library("gbm")
library("randomForest")
library("MMST")
data(wine)
aux <- c(1:178)
train_indis <- sample(aux, 142, replace = FALSE)
test_indis <- setdiff(aux, train_indis)
train <- wine[train_indis,]
test <- wine[test_indis,] #### divide the dataset into trainning and testing
model.control <- rpart.control(minsplit = 5, xval = 10, cp = 0)
fit_wine <- rpart(class ~ MalicAcid + Ash + AlcAsh + Mg + Phenols + Proa + Color + Hue + OD + Proline, data = train, method = "class", control = model.control)
windows()
plot(fit_wine,branch = 0.5, uniform = T, compress = T, main = "Full Tree: without pruning")
text(fit_wine, use.n = T, all = T, cex = .6)
我可以得到这样的图像:
And I could get a image like this:
每个节点下的数字(例如 Grignolino 下的 0/1/48)是什么意思?如果我想知道每个节点有多少训练和测试样本,我应该在代码中写什么?
What does the number under each node (for example 0/1/48 under Grignolino) mean? If I want to know how many training and testing sample fall into each node, what should I write in the codes?
推荐答案
数字表示该节点中每个类的成员数量.因此,标签0/1/48"告诉我们,类别 1(Barabera,我推断)有 0 个案例,类别 2(Barolo)只有一个案例,类别 3(Grignolino)有 48 个案例.
The numbers indicate the number of members of each class in that node. So, the label "0 / 1 / 48" tells us that there are 0 cases of category 1 (Barabera, I infer), only one example of category 2 (Barolo), and 48 of category 3 (Grignolino).
您可以使用summary(fit_wine)
获取有关树和每个节点的详细信息.
有关更多详细信息,请参阅 ?summary.rpart
.
You can get detailed information about the tree and each node using summary(fit_wine)
.
See ?summary.rpart
for more details.
您还可以使用 predict()
(它将调用 predict.rpart()
)来查看树如何对数据集进行分类.例如,predict(fit_wine, train, type="class")
.或者把它包在一个表格里方便查看 table(predict(fit_wine, train, type = "class"),train[,"class"])
You can additionally use predict()
(which will call predict.rpart()
) to see how the tree categorizes a dataset. For example, predict(fit_wine, train, type="class")
. Or wrap it in a table for easy viewing table(predict(fit_wine, train, type = "class"),train[,"class"])
如果您特别想知道观察落在哪个叶节点,此信息存储在 fit_wine$where
中.对于数据集中的每个案例,fit_wine$where
包含表示案例所在的叶节点的 fit_wine$frame
的行号.所以我们可以通过以下方式获取每个案例的叶子信息:
If you specifically want to know which leaf node an observation falls on, this information is stored in fit_wine$where
. For each case in the data set,fit_wine$where
contains the row number of fit_wine$frame
that represents the leaf node where the case falls. So we can get the leaf information for each case with:
trainingnodes <- rownames(fit_wine$frame)[fit_wine$where]
为了获取测试数据的叶信息,我曾经使用 type="matrix"
运行 predict()
并推断它.令人困惑的是,这会返回一个矩阵,该矩阵通过连接预测类、拟合树中该节点处的类计数以及类概率而产生.所以对于这个例子:
In order to get the leaf info for test data, I used to run predict()
with type="matrix"
and infer it. This returns, confusingly, a matrix produced by concatenating the predicted class, the class counts at that node in the fitted tree, and the class probabilities. So for this example:
testresults <- predict(fit_wine, test, type = "matrix")
testresults <- data.frame(testresults)
names(testresults) <- c("ClassGuess","NofClass1onNode", "NofClass2onNode",
"NofClass3onNode", "PClass1", "PClass2", "PClass2")
由此,我们可以推断出不同的节点,例如,从 unique(testresults[,2:4]
) 但它是不优雅的.
From this, we can infer the different nodes, e.g., from unique(testresults[,2:4]
) but it is inelegant.
但是,Yuji 在上一个问题中有一个聪明的技巧.他复制 rpart 对象并用节点代替类,因此运行 predict 返回节点而不是类:
However, Yuji has a clever hack for this at a previous question. He copies the rpart object and substitutes the nodes in for the classes, so running predict returns the node not the class:
nodes_wine <- fit_wine
nodes_wine$frame$yval = as.numeric(rownames(nodes_wine$frame))
testnodes <- predict(nodes_wine, test, type="vector")
我在此处提供了解决方案,但人们go 应该给他投票.
I've included the solution here, but people go should upvote him .
这篇关于如何计算落在树的每个节点中的观察值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!