matlab 'fitctree' 的 CART 算法考虑了属性顺序,为什么? [英] CART algorithm of matlab 'fitctree' takes account on the attributes order why ?

查看:27
本文介绍了matlab 'fitctree' 的 CART 算法考虑了属性顺序,为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个例子,提到matlab的fitctree考虑了特征顺序!为什么?

here is an example mentionning that fitctree of matlab takes into account the features order ! why ?

load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y)
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y)
view(Mdl1,'mode','graph');

不同的模型,因此即使处理相同的特征,分类精度也不相同?

Not the same model, thus not the same classification accuracy despite dealing with the same features ?

推荐答案

在您的示例中,X 包含 34 个预测变量.预测变量不包含名称,fitctree 仅通过它们的列号 x1, x2, ..., x34 引用它们.如果翻转表格,列号会发生变化,因此它们的名称也会发生变化.所以 x1 ->x34.<代码>x2 ->x33 等.

In your example, Xcontains 34 predictors. The predictors contain no names and fitctreejust refers to them by their column numbers x1, x2, ..., x34. If you flip the table, the column number changes and therefore their name. So x1 -> x34. x2 -> x33, etc..

在大多数节点中,这无关紧要,因为 CART 总是将节点除以最大化两个子节点之间杂质增益的预测器.但有时有多个预测因子会导致相同的杂质增益.然后它只选择具有最低列号的那个.并且由于列号通过对预测变量重新排序而发生变化,因此您最终会在该节点获得不同的预测变量.

In for most nodes this does not matter because CART always divides a node by the predictor that maximises the impurity gain between the two child nodes. But sometimes there are multiple predictors which result in the same impurity gain. Then it just picks the one with the lowest column number. And since the column number changed by reordering the predictors, you end up with a different predictor at that node.

例如让我们看看标记的分割:

E.g. let's look at the marked split:

原始订单(mdl):翻转订单(mdl1):

Original order (mdl): Flipped order (mdl1):

到目前为止,始终选择相同的预测变量和值.名称因订单而更改,例如旧数据中的 x5 = 新模型中的 x30.但是 x3x6 实际上是不同的预测器.翻转顺序中的x6是原始顺序中的x29.

Up to this point always the same predictor and values have been chosen. Names changed due to order, e.g. x5 in the old data = x30 in the new model. But x3 and x6 are actually different predictors. x6 in the flipped order is x29 in the original order.

这些预测变量之间的散点图显示了这是如何发生的:

A scatter plot between those predictors shows how this could happen:

其中蓝色和青色线分别标记了 mdlmdl1 在该节点执行的拆分.正如我们所看到的,两个分割都会产生每个标签具有相同元素数量的子节点!因此CART可以选择两个预测器中的任何一个,它会导致相同的杂质增益.

Where blue and cyan lines mark the splits performed by mdl and mdl1 respectively at that node. As we can see, both splits yield child nodes with the same number of elements per label! Therefore CART can chose any of the two predictors, it will cause the same impurity gain.

在这种情况下,它似乎只是选择列号较低的那个.在非翻转表中选择 x3 而不是 x29 因为 3 <29.但是如果你翻转表格,x3 变成 x32x29 变成 x6.由于 6 <32 你现在得到了 x6,原始的 x29.

In that case it seems to just pick the one with the lower column number. In the non-flipped table x3 is chosen instead of x29 because 3 < 29. But if you flip the tables, x3 becomes x32 and x29 becomes x6. Since 6 < 32 you now end up with x6, the original x29.

最终这并不重要——翻转表的决策树没有好坏之分.它只发生在树开始过度拟合的较低节点中.所以你真的不必关心它.

Ultimately this does not matter - the decision tree of the flipped table is not better or worse. It only happens in the lower nodes where the tree starts to overfit. So you really don't have to care about it.

附录:

散点图生成代码:

load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y);
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y);
view(Mdl1,'mode','graph');

idx = (X(:,5)>=0.23154 & X(:,27)>=0.999945 & X(:,1)>=0.5);
remainder = X(idx,:);
labels = cell2mat(Y(idx,:));

gscatter(remainder(:,3), remainder(:,(35-6)), labels,'rgb','osd');

limits = [-1.5 1.5];
xlim(limits)
ylim(limits)
xlabel('predictor 3')
ylabel('predictor 29')
hold on
plot([0.73 0.73], limits, '-b')
plot(limits, [0.693 0.693], '-c')
legend({'b' 'g'})

这篇关于matlab 'fitctree' 的 CART 算法考虑了属性顺序,为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆