了解R gbm包中的树结构 [英] Understanding tree structure in R gbm package
问题描述
我很难理解R的gbm梯度增强机器包中树木的结构.具体来说,请查看pretty.gbm.tree
SplitVar
中的哪些索引指向的功能?
I am having some difficulty understanding how the trees are structured in R's gbm gradient boosted machine package. Specifically, looking at the output of the pretty.gbm.tree
Which features do the indices in SplitVar
point to?
我在数据集上训练了一个GBM,这是我的一棵树的顶部〜四分之一-调用pretty.gbm.tree
的结果:
I trained a GBM on a dataset, here is the top ~quarter of one of my trees -- the result of a call to pretty.gbm.tree
:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061
1 -1 1.895699e-12 -1 -1 -1 0.0000000 3013 0.018956988
2 31 4.462500e+02 3 4 20 1.0083722 2968 -0.009168477
3 -1 1.388483e-22 -1 -1 -1 0.0000000 1430 0.013884830
4 38 5.500000e+00 5 18 19 1.5748155 1538 -0.030602956
5 24 7.530000e+03 6 13 17 2.8329899 361 -0.078738904
6 41 2.750000e+01 7 11 12 2.2499063 334 -0.064752766
7 28 -3.155000e+02 8 9 10 1.5516610 57 -0.243675567
8 -1 -3.379312e-11 -1 -1 -1 0.0000000 45 -0.337931219
9 -1 1.922333e-10 -1 -1 -1 0.0000000 12 0.109783128
```
在我看来,从LeftNode, RightNode
和MissingNode
指向不同行的角度来看,索引是基于0的.当通过使用数据样本并按照其预测树进行测试时,当我认为SplitVar
使用基于 1的索引时,我会得到正确的答案.
It looks to me here that the indices are 0 based, from looking at how LeftNode, RightNode
, and MissingNode
point to different rows. When testing this out by using data samples and following it down the tree to their prediction, I get the correct answer when I consider SplitVar
to be using 1 based indexing.
但是,我构建的许多树中有1个在SplitVar
列中有一个 zero !这是这棵树:
However, 1 of the many trees I build has a zero in the SplitVar
column! Here is this tree:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 4 1.462500e+02 1 2 21 0.41887 5981 0.0021651262
1 -1 4.117688e-22 -1 -1 -1 0.00000 512 0.0411768781
2 4 1.472500e+02 3 4 20 1.05222 5469 -0.0014870985
3 -1 -2.062798e-11 -1 -1 -1 0.00000 23 -0.2062797579
4 0 4.750000e+00 5 6 19 0.65424 5446 -0.0006222011
5 -1 3.564879e-23 -1 -1 -1 0.00000 4897 0.0035648788
6 28 -3.195000e+02 7 11 18 1.39452 549 -0.0379703437
查看gbm树使用的索引的正确方法是什么?
推荐答案
使用pretty.gbm.tree
时打印的第一列是在脚本pretty.gbm.tree.R
中分配的row.names
.在脚本中,row.names
被分配为row.names(temp) <- 0:(nrow(temp)-1)
,其中temp
是以data.frame
形式存储的树信息.解释row.names
的正确方法是将根节点分配为0值,将其读取为node_id
.
The first column that is printed when you use the pretty.gbm.tree
is the row.names
that is assigned in the script pretty.gbm.tree.R
. In the script, the row.names
is assigned as row.names(temp) <- 0:(nrow(temp)-1)
where temp
is the tree information stored in data.frame
form. The right way to interpret the row.names
is to read it as the node_id
with the root node being assigned a 0 value.
在您的示例中:
Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061
Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061
表示根节点(由行号0表示)被第9个拆分变量拆分(此处拆分变量的编号从0开始,因此拆分变量是训练集中的第10列x
). 6.25
中的SplitCodePred
表示所有小于6.25
的点都进入了LeftNode 1
,所有大于6.25
的点都进入了RightNode 2
.在此列中所有具有缺失值的点都分配给MissingNode 21
.由于此拆分,ErrorReduction
为0.6634
,并且根节点中有5981(Weight
). 0.005
中的Prediction
表示在分割点之前已分配给该节点所有值的值.在SplitVar
,LeftNode
,RightNode
和MissingNode
中用-1
表示的终端节点(或叶子)的情况下,Prediction
表示针对属于该叶子节点的所有点的预测值调整(倍)shrinkage
的倍数.
means that the root node (indicated by the row number 0) is split by the 9-th split variable (the numbering of the split variable here starts from 0, so the split variable is the 10th column in the training set x
). SplitCodePred
of 6.25
denotes that all points less than 6.25
went to the LeftNode 1
and all points greater than 6.25
went to RightNode 2
. All points that had a missing value in this column were assigned to the MissingNode 21
. The ErrorReduction
was 0.6634
due to this split and there were 5981 (Weight
) in the root node. Prediction
of 0.005
denotes the value assigned to all values at this node before the point was split. In the case of terminal nodes (or leaves) denoted by -1
in SplitVar
, LeftNode
, RightNode
, and MissingNode
, the Prediction
denotes the value predicted for all the points belonging to this leaf node adjusted (times) times the shrinkage
.
要了解树的结构,请务必注意,树的拆分以深度优先的方式进行.因此,当根节点(节点ID为0)拆分为左节点和右节点时,将处理左侧,直到无法再进行拆分为止,然后再返回并标记右节点.在示例中的两棵树中,RightNode
的值为2.这是因为在两种情况下,LeftNode
都是叶子节点.
To understand the tree structure, its important to note that the splitting of the tree happens in a depth first fashion. So when the root node (with node id 0) is split into its left node and right node, the left side is processed until no further splits are possible before returning and labeling the right node. In both the trees in your example, the RightNode
gets a value of 2. This is because in both cases, the LeftNode
turns out to be a leaf node.
这篇关于了解R gbm包中的树结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!