解释 Graphviz 输出以进行决策树回归 [英] interpreting Graphviz output for decision tree regression
问题描述
我很好奇 Graphviz 在用于回归时生成的决策树节点中的 value
字段是什么.我知道这是在使用决策树分类时每个类中被分割分开的样本数,但我不确定这对回归意味着什么.
I'm curious what the value
field is in the nodes of the decision tree produced by Graphviz when used for regression. I understand that this is the number of samples in each class that are separated by a split when using decision tree classification but I'm not sure what it means for regression.
我的数据有一个 2 维输入和一个 10 维输出.这是我的回归问题的树的示例:
My data has a 2 dimensional input and a 10 dimensional output. Here is an example of what a tree looks like for my regression problem:
使用此代码制作 &使用 webgraphviz 可视化
produced using this code & visualized with webgraphviz
# X = (n x 2) Y = (n x 10) X_test = (m x 2)
input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
f = tree.export_graphviz(reg, out_file=f)
推荐答案
回归树实际作为输出返回的是结束训练样本的因变量(此处为 Y)的平均值up 在各自的终端节点(叶子);这些平均值显示为图片中名为 value
的列表,这里的长度均为 10,因为您的 Y 是 10 维的.
What a regression tree actually returns as output is the mean value of the dependent variable (here Y) of the training samples that end up in the respective terminal nodes (leaves); these mean values are shown as lists named value
in the picture, which are all of length 10 here, since your Y is 10-dimensional.
换句话说,以树的最左边的终端节点(叶子)为例:
In other words, and using the leftmost terminal node (leaf) of your tree as an example:
- 叶子由 42 个样本组成,其中
X[0] <= 0.675
和X[1] <= 0.5
- 这42个样本的10维输出的平均值在这个假期的
value
列表中给出,它的长度确实是10,即Y[0的平均值]
为-152007.382
,Y[1]
的平均值为-206040.675
等,Y[9]
是3211.487
.
- The leaf consists of the 42 samples for which
X[0] <= 0.675
andX[1] <= 0.5
- The mean value of your 10-dimensional output for these 42 samples is given in the
value
list of this leave, which is of length 10 indeed, i.e. the mean ofY[0]
is-152007.382
, the mean ofY[1]
is-206040.675
etc and the mean ofY[9]
is3211.487
.
您可以通过预测一些样本(来自您的训练或测试集 - 无关紧要)并检查您的 10 维结果是否是 4 个 值
之一来确认情况确实如此上面终端中描述的列表.
You can confirm that this is the case by predicting some samples (from your training or test set - it doesn't matter) and checking that your 10-dimensional result is one of the 4 value
lists depicted in the terminal leaves above.
另外,您可以确认,对于value
中的每个元素,子节点的加权平均值等于父节点的相应元素.同样,使用最左边的 2 个终端节点(叶子)的第一个元素,我们得到:
Additionally, you can confirm that, for each element in value
, the weighted averages of the children nodes are equal to the respective element of the parent node. Again, using the first element of your 2 leftmost terminal nodes (leaves), we get:
(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858
即其父节点的 value[0]
元素(中间层中最左边的节点).再举一个例子,这次是你的 2 个中间节点的第一个 value
元素:
i.e. the value[0]
element of their parent node (the leftmost node in the intermediate level). One more example, this time for the first value
elements of your 2 intermediate nodes:
(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822
再次与根节点的 -0.0
第一个 value
元素一致.
which again agrees with the -0.0
first value
element of your root node.
从根节点的 value
列表来看,似乎你的 10 维 Y 的所有元素的平均值几乎为零,你可以(也应该)手动验证,如最后确认.
Judging from the value
list of your root node, it seems that the mean values of all elements of your 10-dimensional Y are almost zero, which you can (and should) verify manually, as a final confirmation.
总结一下:
- 每个节点的
value
列表包含属于"各个节点的训练样本的平均Y值 - 此外,对于终端节点(叶子),这些列表是树模型的实际输出(即输出将始终是这些列表之一,取决于 X)
- 对于根节点,
value
列表包含整个训练数据集的平均 Y 值
- The
value
list of each node contains the mean Y values for the training samples "belonging" to the respective node - Additionally, for the terminal nodes (leaves), these lists are the actual outputs of the tree model (i.e. the output will always be one of these lists, depending on X)
- For the root node, the
value
list contains the mean Y values for the whole of your training dataset
这篇关于解释 Graphviz 输出以进行决策树回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!