了解 R gbm 包中的树结构 [英] Understanding tree structure in R gbm package

查看:22
本文介绍了了解 R gbm 包中的树结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在理解 R 的 gbm 梯度提升机器包中的树的结构时遇到了一些困难.具体来说,查看 pretty.gbm.tree 的输出 SplitVar 中的索引指向哪些特征?

我在一个数据集上训练了一个 GBM,这是我的一棵树的顶部~四分之一——调用 pretty.gbm.tree 的结果:

 SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction0 9 6.250000e+01 1 2 21 0.6634681 5981 0.0050000611 -1 1.895699e-12 -1 -1 -1 0.0000000 3013 0.0189569882 31 4.462500e+02 3 4 20 1.0083722 2968 -0.0091684773 -1 1.388483e-22 -1 -1 -1 0.0000000 1430 0.0138848304 38 5.500000e+00 5 18 19 1.5748155 1538 -0.0306029565 24 7.530000e+03 6 13 17 2.8329899 361 -0.0787389046 41 2.750000e+01 7 11 12 2.2499063 334 -0.0647527667 28 -3.155000e+02 8 9 10 1.5516610 57 -0.2436755678 -1 -3.379312e-11 -1 -1 -1 0.0000000 45 -0.3379312199 -1 1.922333e-10 -1 -1 -1 0.0000000 12 0.109783128``

在我看来,从查看 LeftNode、RightNodeMissingNode 指向不同行的方式来看,索引是基于 0 的.通过使用数据样本进行测试并沿着树向下进行预测时,当我认为 SplitVar 使用基于 1 的索引时,我得到了正确的答案.>

然而,我构建的众多树中有 1 棵在 SplitVar 列中有一个 !这是这棵树:

SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction0 4 1.462500e+02 1 2 21 0.41887 5981 0.00216512621 -1 4.117688e-22 -1 -1 -1 0.00000 512 0.04117687812 4 1.472500e+02 3 4 20 1.05222 5469 -0.00148709853 -1 -2.062798e-11 -1 -1 -1 0.00000 23 -0.20627975794 0 4.750000e+00 5 6 19 0.65424 5446 -0.00062220115 -1 3.564879e-23 -1 -1 -1 0.00000 4897 0.00356487886 28 -3.195000e+02 7 11 18 1.39452 549 -0.0379703437

查看gbm的树使用的索引的正确方法是什么?

解决方案

使用 pretty.gbm.tree 时打印的第一列是 row.names 在脚本 pretty.gbm.tree.R 中分配.在脚本中,row.names 被分配为 row.names(temp) <- 0:(nrow(temp)-1) 其中 temp是以data.frame形式存储的树信息.解释 row.names 的正确方法是将其读取为 node_id 并为根节点分配 0 值.

在你的例子中:

Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061

表示根节点(用行号0表示)被第9个分裂变量分裂(这里分裂变量的编号从0开始,所以分裂变量是训练集中的第10列<代码>x).6.25SplitCodePred表示所有小于6.25的点都到了LeftNode 1,所有大于的点>6.25 转到 RightNode 2.此列中所有缺失值的点都分配给 MissingNode 21.由于这种分裂,ErrorReduction0.6634,根节点中有 5981(Weight).0.005Prediction 表示在点分裂之前分配给该节点上所有值的值.在SplitVarLeftNodeRightNode-1表示的终端节点(或叶子)的情况下MissingNodePrediction 表示为属于该叶节点的所有点的预测值调整(倍)次 shr​​inkage.

要理解树结构,重要的是要注意树的分裂是以深度优先的方式发生的.因此,当根节点(节点 id 为 0)被拆分为其左节点和右节点时,将处理左侧,直到在返回并标记右节点之前无法进一步拆分.在您的示例中的两棵树中,RightNode 的值为 2.这是因为在这两种情况下,LeftNode 结果都是叶节点.

I am having some difficulty understanding how the trees are structured in R's gbm gradient boosted machine package. Specifically, looking at the output of the pretty.gbm.tree Which features do the indices in SplitVar point to?

I trained a GBM on a dataset, here is the top ~quarter of one of my trees -- the result of a call to pretty.gbm.tree:

   SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight   Prediction
0         9  6.250000e+01        1         2          21      0.6634681   5981  0.005000061
1        -1  1.895699e-12       -1        -1          -1      0.0000000   3013  0.018956988
2        31  4.462500e+02        3         4          20      1.0083722   2968 -0.009168477
3        -1  1.388483e-22       -1        -1          -1      0.0000000   1430  0.013884830
4        38  5.500000e+00        5        18          19      1.5748155   1538 -0.030602956
5        24  7.530000e+03        6        13          17      2.8329899    361 -0.078738904
6        41  2.750000e+01        7        11          12      2.2499063    334 -0.064752766
7        28 -3.155000e+02        8         9          10      1.5516610     57 -0.243675567
8        -1 -3.379312e-11       -1        -1          -1      0.0000000     45 -0.337931219
9        -1  1.922333e-10       -1        -1          -1      0.0000000     12  0.109783128
```

It looks to me here that the indices are 0 based, from looking at how LeftNode, RightNode, and MissingNode point to different rows. When testing this out by using data samples and following it down the tree to their prediction, I get the correct answer when I consider SplitVar to be using 1 based indexing.

However, 1 of the many trees I build has a zero in the SplitVar column! Here is this tree:

SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight    Prediction
0         4  1.462500e+02        1         2          21      0.41887   5981  0.0021651262
1        -1  4.117688e-22       -1        -1          -1      0.00000    512  0.0411768781
2         4  1.472500e+02        3         4          20      1.05222   5469 -0.0014870985
3        -1 -2.062798e-11       -1        -1          -1      0.00000     23 -0.2062797579
4         0  4.750000e+00        5         6          19      0.65424   5446 -0.0006222011
5        -1  3.564879e-23       -1        -1          -1      0.00000   4897  0.0035648788
6        28 -3.195000e+02        7        11          18      1.39452    549 -0.0379703437

What is the correct way to view the indexing used by gbm's trees?

解决方案

The first column that is printed when you use the pretty.gbm.tree is the row.names that is assigned in the script pretty.gbm.tree.R. In the script, the row.names is assigned as row.names(temp) <- 0:(nrow(temp)-1) where temp is the tree information stored in data.frame form. The right way to interpret the row.names is to read it as the node_id with the root node being assigned a 0 value.

In your example:

Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction 0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061

means that the root node (indicated by the row number 0) is split by the 9-th split variable (the numbering of the split variable here starts from 0, so the split variable is the 10th column in the training set x). SplitCodePred of 6.25 denotes that all points less than 6.25 went to the LeftNode 1 and all points greater than 6.25 went to RightNode 2. All points that had a missing value in this column were assigned to the MissingNode 21. The ErrorReduction was 0.6634 due to this split and there were 5981 (Weight) in the root node. Prediction of 0.005 denotes the value assigned to all values at this node before the point was split. In the case of terminal nodes (or leaves) denoted by -1 in SplitVar, LeftNode, RightNode, and MissingNode, the Prediction denotes the value predicted for all the points belonging to this leaf node adjusted (times) times the shrinkage.

To understand the tree structure, its important to note that the splitting of the tree happens in a depth first fashion. So when the root node (with node id 0) is split into its left node and right node, the left side is processed until no further splits are possible before returning and labeling the right node. In both the trees in your example, the RightNode gets a value of 2. This is because in both cases, the LeftNode turns out to be a leaf node.

这篇关于了解 R gbm 包中的树结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆