解释Graphviz输出以进行决策树回归 [英] interpreting Graphviz output for decision tree regression

查看:603
本文介绍了解释Graphviz输出以进行决策树回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇Graphviz用于回归时决策树的节点中的value字段是什么.我知道这是使用决策树分类时每个类别中被拆分分开的样本数,但是我不确定这对回归意味着什么.

我的数据有2维输入和10维输出.这是我的回归问题的一棵树的示例:

使用此代码&用webgraphviz可视化

 # X = (n x 2)  Y = (n x 10)  X_test = (m x 2)

input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
    f = tree.export_graphviz(reg, out_file=f)
 

谢谢!

解决方案

回归树实际作为输出返回的是训练样本的因变量(此处为Y)的均值在各自的终端节点(叶)中;这些平均值显示为图片中名为value的列表,此处的长度均为10,因为您的Y是10维的.

换句话说,以树的最左边的终端节点(叶)为例:

  • 叶子由X[0] <= 0.675X[1] <= 0.5
  • 的42个样本组成
  • 此休假的value列表中给出了这42个样本的10维输出的平均值,其长度实际上为10,即Y[0]的平均值为-152007.382,即Y[1]-206040.675等,Y[9]的平均值是3211.487.

您可以通过预测一些样本(从您的训练或测试集开始,没关系)并检查您的10维结果是否是终端机中显示的4个value列表之一,来确认是否是这种情况离开上面.

此外,您可以确认,对于value中的每个元素,子节点的加权平均值等于父节点的各个元素.同样,使用最左边的两个终端节点(叶)的第一个元素,我们得到:

 (-42*152007.382 - 56*199028.147)/98
# -178876.39057142858
 

即其父节点(中间级别中最左边的节点)的value[0]元素.另一个示例,这一次是您的2个中间节点的前value个元素:

 (-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822
 

再次与您的根节点的-0.0第一个value元素一致.

从根节点的value列表来看,似乎10维Y的所有元素的平均值几乎为零,您可以(并且应该)手动进行验证,以作为最终确认. /p>


因此,总结一下:

  • 每个节点的value列表包含属于"相应节点的训练样本的平均Y值
  • 此外,对于终端节点(叶),这些列表是树模型的实际输出(即,输出将始终是这些列表之一,具体取决于X)
  • 对于根节点,value列表包含整个训练数据集的平均值Y值

I'm curious what the value field is in the nodes of the decision tree produced by Graphviz when used for regression. I understand that this is the number of samples in each class that are separated by a split when using decision tree classification but I'm not sure what it means for regression.

My data has a 2 dimensional input and a 10 dimensional output. Here is an example of what a tree looks like for my regression problem:

produced using this code & visualized with webgraphviz

# X = (n x 2)  Y = (n x 10)  X_test = (m x 2)

input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
    f = tree.export_graphviz(reg, out_file=f)

thanks!

解决方案

What a regression tree actually returns as output is the mean value of the dependent variable (here Y) of the training samples that end up in the respective terminal nodes (leaves); these mean values are shown as lists named value in the picture, which are all of length 10 here, since your Y is 10-dimensional.

In other words, and using the leftmost terminal node (leaf) of your tree as an example:

  • The leaf consists of the 42 samples for which X[0] <= 0.675 and X[1] <= 0.5
  • The mean value of your 10-dimensional output for these 42 samples is given in the value list of this leave, which is of length 10 indeed, i.e. the mean of Y[0] is -152007.382, the mean of Y[1] is -206040.675 etc and the mean of Y[9] is 3211.487.

You can confirm that this is the case by predicting some samples (from your training or test set - it doesn't matter) and checking that your 10-dimensional result is one of the 4 value lists depicted in the terminal leaves above.

Additionally, you can confirm that, for each element in value, the weighted averages of the children nodes are equal to the respective element of the parent node. Again, using the first element of your 2 leftmost terminal nodes (leaves), we get:

(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858

i.e. the value[0] element of their parent node (the leftmost node in the intermediate level). One more example, this time for the first value elements of your 2 intermediate nodes:

(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822

which again agrees with the -0.0 first value element of your root node.

Judging from the value list of your root node, it seems that the mean values of all elements of your 10-dimensional Y are almost zero, which you can (and should) verify manually, as a final confirmation.


So, to wrap-up:

  • The value list of each node contains the mean Y values for the training samples "belonging" to the respective node
  • Additionally, for the terminal nodes (leaves), these lists are the actual outputs of the tree model (i.e. the output will always be one of these lists, depending on X)
  • For the root node, the value list contains the mean Y values for the whole of your training dataset

这篇关于解释Graphviz输出以进行决策树回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆