Matlab'fitctree'的CART算法为什么考虑属性顺序? [英] CART algorithm of matlab 'fitctree' takes account on the attributes order why ?

查看:438
本文介绍了Matlab'fitctree'的CART算法为什么考虑属性顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个示例,其中提到matlab的fitctree考虑了功能顺序!为什么?

here is an example mentionning that fitctree of matlab takes into account the features order ! why ?

load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y)
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y)
view(Mdl1,'mode','graph');

不是同一模型,因此尽管处理相同的特征,但分类精度也不相同?

Not the same model, thus not the same classification accuracy despite dealing with the same features ?

推荐答案

在您的示例中,X包含34个预测变量.预测变量不包含名称,fitctree仅通过其列编号x1, x2, ..., x34引用它们.如果您翻转表格,则列号会更改,因此其名称也会更改.所以x1 -> x34. x2 -> x33等.

In your example, Xcontains 34 predictors. The predictors contain no names and fitctreejust refers to them by their column numbers x1, x2, ..., x34. If you flip the table, the column number changes and therefore their name. So x1 -> x34. x2 -> x33, etc..

对于大多数节点而言,这无关紧要,因为CART总是将节点除以预测变量,从而使两个子节点之间的杂质增益最大化.但是有时会有多个预测变量导致相同的杂质增益.然后,它只选择列号最低的那个.而且由于列号是通过对预测变量重新排序而更改的,因此最终在该节点处使用了不同的预测变量.

In for most nodes this does not matter because CART always divides a node by the predictor that maximises the impurity gain between the two child nodes. But sometimes there are multiple predictors which result in the same impurity gain. Then it just picks the one with the lowest column number. And since the column number changed by reordering the predictors, you end up with a different predictor at that node.

例如让我们看一下标记的拆分:

E.g. let's look at the marked split:

原始订单(mdl): 订单已下达(mdl1):

Original order (mdl): Flipped order (mdl1):

到目前为止,始终选择相同的预测变量和值.名称因订单而更改,例如旧数据中的x5 =新模型中的x30.但是x3x6实际上是不同的预测变量.翻转顺序中的x6是原始顺序中的x29.

Up to this point always the same predictor and values have been chosen. Names changed due to order, e.g. x5 in the old data = x30 in the new model. But x3 and x6 are actually different predictors. x6 in the flipped order is x29 in the original order.

这些预测变量之间的散点图显示了这种情况如何发生:

A scatter plot between those predictors shows how this could happen:

蓝色和青色线分别标记由mdlmdl1在该节点执行的拆分.如我们所见,两个拆分都将产生每个标签具有相同数量元素的子节点!因此,CART可以选择两个预测器中的任何一个,这将导致相同的杂质增益.

Where blue and cyan lines mark the splits performed by mdl and mdl1 respectively at that node. As we can see, both splits yield child nodes with the same number of elements per label! Therefore CART can chose any of the two predictors, it will cause the same impurity gain.

在这种情况下,似乎只是选择列号较低的那个.在非翻转表中,选择x3而不是x29,因为3 < 29.但是,如果您翻转表格,则x3变为x32,而x29变为x6.从6 < 32开始,您现在得到的是x6,即原始的x29.

In that case it seems to just pick the one with the lower column number. In the non-flipped table x3 is chosen instead of x29 because 3 < 29. But if you flip the tables, x3 becomes x32 and x29 becomes x6. Since 6 < 32 you now end up with x6, the original x29.

最终这无关紧要-翻转表的决策树并不好坏.它仅在树开始过度拟合的较低节点中发生.因此,您真的不必关心它.

Ultimately this does not matter - the decision tree of the flipped table is not better or worse. It only happens in the lower nodes where the tree starts to overfit. So you really don't have to care about it.

附录:

散点图生成的代码:

load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y);
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y);
view(Mdl1,'mode','graph');

idx = (X(:,5)>=0.23154 & X(:,27)>=0.999945 & X(:,1)>=0.5);
remainder = X(idx,:);
labels = cell2mat(Y(idx,:));

gscatter(remainder(:,3), remainder(:,(35-6)), labels,'rgb','osd');

limits = [-1.5 1.5];
xlim(limits)
ylim(limits)
xlabel('predictor 3')
ylabel('predictor 29')
hold on
plot([0.73 0.73], limits, '-b')
plot(limits, [0.693 0.693], '-c')
legend({'b' 'g'})

这篇关于Matlab'fitctree'的CART算法为什么考虑属性顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆