解释Graphviz输出以进行决策树回归 [英] interpreting Graphviz output for decision tree regression
问题描述
我很好奇Graphviz用于回归时决策树的节点中的value
字段是什么.我知道这是使用决策树分类时每个类别中被拆分分开的样本数,但是我不确定这对回归意味着什么.
我的数据有2维输入和10维输出.这是我的回归问题的一棵树的示例:
使用此代码&用webgraphviz可视化
# X = (n x 2) Y = (n x 10) X_test = (m x 2)
input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
f = tree.export_graphviz(reg, out_file=f)
谢谢!
回归树实际作为输出返回的是训练样本的因变量(此处为Y)的均值在各自的终端节点(叶)中;这些平均值显示为图片中名为value
的列表,此处的长度均为10,因为您的Y是10维的.
换句话说,以树的最左边的终端节点(叶)为例:
- 叶子由
X[0] <= 0.675
和X[1] <= 0.5
的42个样本组成
- 此休假的
value
列表中给出了这42个样本的10维输出的平均值,其长度实际上为10,即Y[0]
的平均值为-152007.382
,即Y[1]
是-206040.675
等,Y[9]
的平均值是3211.487
.
您可以通过预测一些样本(从您的训练或测试集开始,没关系)并检查您的10维结果是否是终端机中显示的4个value
列表之一,来确认是否是这种情况离开上面.
此外,您可以确认,对于value
中的每个元素,子节点的加权平均值等于父节点的各个元素.同样,使用最左边的两个终端节点(叶)的第一个元素,我们得到:
(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858
即其父节点(中间级别中最左边的节点)的value[0]
元素.另一个示例,这一次是您的2个中间节点的前value
个元素:
(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822
再次与您的根节点的-0.0
第一个value
元素一致.
从根节点的value
列表来看,似乎10维Y的所有元素的平均值几乎为零,您可以(并且应该)手动进行验证,以作为最终确认. /p>
因此,总结一下:
- 每个节点的
value
列表包含属于"相应节点的训练样本的平均Y值 - 此外,对于终端节点(叶),这些列表是树模型的实际输出(即,输出将始终是这些列表之一,具体取决于X)
- 对于根节点,
value
列表包含整个训练数据集的平均值Y值
I'm curious what the value
field is in the nodes of the decision tree produced by Graphviz when used for regression. I understand that this is the number of samples in each class that are separated by a split when using decision tree classification but I'm not sure what it means for regression.
My data has a 2 dimensional input and a 10 dimensional output. Here is an example of what a tree looks like for my regression problem:
produced using this code & visualized with webgraphviz
# X = (n x 2) Y = (n x 10) X_test = (m x 2)
input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
f = tree.export_graphviz(reg, out_file=f)
thanks!
What a regression tree actually returns as output is the mean value of the dependent variable (here Y) of the training samples that end up in the respective terminal nodes (leaves); these mean values are shown as lists named value
in the picture, which are all of length 10 here, since your Y is 10-dimensional.
In other words, and using the leftmost terminal node (leaf) of your tree as an example:
- The leaf consists of the 42 samples for which
X[0] <= 0.675
andX[1] <= 0.5
- The mean value of your 10-dimensional output for these 42 samples is given in the
value
list of this leave, which is of length 10 indeed, i.e. the mean ofY[0]
is-152007.382
, the mean ofY[1]
is-206040.675
etc and the mean ofY[9]
is3211.487
.
You can confirm that this is the case by predicting some samples (from your training or test set - it doesn't matter) and checking that your 10-dimensional result is one of the 4 value
lists depicted in the terminal leaves above.
Additionally, you can confirm that, for each element in value
, the weighted averages of the children nodes are equal to the respective element of the parent node. Again, using the first element of your 2 leftmost terminal nodes (leaves), we get:
(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858
i.e. the value[0]
element of their parent node (the leftmost node in the intermediate level). One more example, this time for the first value
elements of your 2 intermediate nodes:
(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822
which again agrees with the -0.0
first value
element of your root node.
Judging from the value
list of your root node, it seems that the mean values of all elements of your 10-dimensional Y are almost zero, which you can (and should) verify manually, as a final confirmation.
So, to wrap-up:
- The
value
list of each node contains the mean Y values for the training samples "belonging" to the respective node - Additionally, for the terminal nodes (leaves), these lists are the actual outputs of the tree model (i.e. the output will always be one of these lists, depending on X)
- For the root node, the
value
list contains the mean Y values for the whole of your training dataset
这篇关于解释Graphviz输出以进行决策树回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!