了解R gbm包中的树结构 [英] Understanding tree structure in R gbm package

查看:159
本文介绍了了解R gbm包中的树结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难理解R的gbm梯度增强机器包中树木的结构.具体来说,请查看pretty.gbm.tree SplitVar中的哪些索引指向的功能?

I am having some difficulty understanding how the trees are structured in R's gbm gradient boosted machine package. Specifically, looking at the output of the pretty.gbm.tree Which features do the indices in SplitVar point to?

我在数据集上训练了一个GBM,这是我的一棵树的顶部〜四分之一-调用pretty.gbm.tree的结果:

I trained a GBM on a dataset, here is the top ~quarter of one of my trees -- the result of a call to pretty.gbm.tree:

   SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight   Prediction
0         9  6.250000e+01        1         2          21      0.6634681   5981  0.005000061
1        -1  1.895699e-12       -1        -1          -1      0.0000000   3013  0.018956988
2        31  4.462500e+02        3         4          20      1.0083722   2968 -0.009168477
3        -1  1.388483e-22       -1        -1          -1      0.0000000   1430  0.013884830
4        38  5.500000e+00        5        18          19      1.5748155   1538 -0.030602956
5        24  7.530000e+03        6        13          17      2.8329899    361 -0.078738904
6        41  2.750000e+01        7        11          12      2.2499063    334 -0.064752766
7        28 -3.155000e+02        8         9          10      1.5516610     57 -0.243675567
8        -1 -3.379312e-11       -1        -1          -1      0.0000000     45 -0.337931219
9        -1  1.922333e-10       -1        -1          -1      0.0000000     12  0.109783128
```

在我看来,从LeftNode, RightNodeMissingNode指向不同行的角度来看,索引是基于0的.当通过使用数据样本并按照其预测树进行测试时,当我认为SplitVar使用基于 1的索引时,我会得到正确的答案.

It looks to me here that the indices are 0 based, from looking at how LeftNode, RightNode, and MissingNode point to different rows. When testing this out by using data samples and following it down the tree to their prediction, I get the correct answer when I consider SplitVar to be using 1 based indexing.

但是,我构建的许多树中有1个在SplitVar列中有一个 zero !这是这棵树:

However, 1 of the many trees I build has a zero in the SplitVar column! Here is this tree:

SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight    Prediction
0         4  1.462500e+02        1         2          21      0.41887   5981  0.0021651262
1        -1  4.117688e-22       -1        -1          -1      0.00000    512  0.0411768781
2         4  1.472500e+02        3         4          20      1.05222   5469 -0.0014870985
3        -1 -2.062798e-11       -1        -1          -1      0.00000     23 -0.2062797579
4         0  4.750000e+00        5         6          19      0.65424   5446 -0.0006222011
5        -1  3.564879e-23       -1        -1          -1      0.00000   4897  0.0035648788
6        28 -3.195000e+02        7        11          18      1.39452    549 -0.0379703437

查看gbm树使用的索引的正确方法是什么?

推荐答案

使用pretty.gbm.tree时打印的第一列是在脚本pretty.gbm.tree.R中分配的row.names.在脚本中,row.names被分配为row.names(temp) <- 0:(nrow(temp)-1),其中temp是以data.frame形式存储的树信息.解释row.names的正确方法是将根节点分配为0值,将其读取为node_id.

The first column that is printed when you use the pretty.gbm.tree is the row.names that is assigned in the script pretty.gbm.tree.R. In the script, the row.names is assigned as row.names(temp) <- 0:(nrow(temp)-1) where temp is the tree information stored in data.frame form. The right way to interpret the row.names is to read it as the node_id with the root node being assigned a 0 value.

在您的示例中:

Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction 0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061

Id SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction 0 9 6.250000e+01 1 2 21 0.6634681 5981 0.005000061

表示根节点(由行号0表示)被第9个拆分变量拆分(此处拆分变量的编号从0开始,因此拆分变量是训练集中的第10列x). 6.25中的SplitCodePred表示所有小于6.25的点都进入了LeftNode 1,所有大于6.25的点都进入了RightNode 2.在此列中所有具有缺失值的点都分配给MissingNode 21.由于此拆分,ErrorReduction0.6634,并且根节点中有5981(Weight). 0.005中的Prediction表示在分割点之前已分配给该节点所有值的值.在SplitVarLeftNodeRightNodeMissingNode中用-1表示的终端节点(或叶子)的情况下,Prediction表示针对属于该叶子节点的所有点的预测值调整(倍)shrinkage的倍数.

means that the root node (indicated by the row number 0) is split by the 9-th split variable (the numbering of the split variable here starts from 0, so the split variable is the 10th column in the training set x). SplitCodePred of 6.25 denotes that all points less than 6.25 went to the LeftNode 1 and all points greater than 6.25 went to RightNode 2. All points that had a missing value in this column were assigned to the MissingNode 21. The ErrorReduction was 0.6634 due to this split and there were 5981 (Weight) in the root node. Prediction of 0.005 denotes the value assigned to all values at this node before the point was split. In the case of terminal nodes (or leaves) denoted by -1 in SplitVar, LeftNode, RightNode, and MissingNode, the Prediction denotes the value predicted for all the points belonging to this leaf node adjusted (times) times the shrinkage.

要了解树的结构,请务必注意,树的拆分以深度优先的方式进行.因此,当根节点(节点ID为0)拆分为左节点和右节点时,将处理左侧,直到无法再进行拆分为止,然后再返回并标记右节点.在示例中的两棵树中,RightNode的值为2.这是因为在两种情况下,LeftNode都是叶子节点.

To understand the tree structure, its important to note that the splitting of the tree happens in a depth first fashion. So when the root node (with node id 0) is split into its left node and right node, the left side is processed until no further splits are possible before returning and labeling the right node. In both the trees in your example, the RightNode gets a value of 2. This is because in both cases, the LeftNode turns out to be a leaf node.

这篇关于了解R gbm包中的树结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆