决策树生成具有相同类的终端叶 [英] Decision Tree generating terminal leaves with same classes

查看:52
本文介绍了决策树生成具有相同类的终端叶的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对决策树比较陌生,并且坚持使用我的决策树算法.我正在使用交叉验证和参数调整来优化以下示例的分类:https://medium.com/@haydar_ai/learning-data-science-day-22-cross-validation-and-parameter-tuning-b14bcbc6b012.但是无论我如何调整我的参数,我总是得到这样的结果(这里只是一个小树的例子):

I'm relatively new to Decision Trees and I'm stuck with my decision tree algorithm. I'm using cross-validation and parameter tuning to optimize the classification following this example: https://medium.com/@haydar_ai/learning-data-science-day-22-cross-validation-and-parameter-tuning-b14bcbc6b012. But however I tune my parameters I always get results looking like this (here just an example for a small tree):

小型决策树示例

我不明白这种行为的原因.为什么树生成具有相同类(此处为 class2)的叶子?为什么它不在 a<=0.375 = TRUE 之后简单地停止并切割具有相同类别的叶子(参见图片红色矩形)?有没有办法防止这种情况并使算法在此时停止?或者对这种行为有合理的解释吗?任何帮助或想法将不胜感激!谢谢!

I don't understand the reasons for this behaviour. Why does the tree generate leaves with the same class (here class2)? Why does it not simply stop after a<=0.375 = TRUE and cut of the leaves with the same class (see picture red rectangle)? Is there a way to prevent this and make the algorithm stop at this point? Or is there a reasonable explanation for this behaviour? Any help or ideas would be highly appreciated! Thanks!

这是我的代码:

     def load_csv(filename):
           dataset = list()
           with open(filename, 'r') as file:
               csv_reader = reader(file)
               for row in csv_reader:
                   if not row:
                       continue
                   dataset.append(row)
           return dataset

    # Convert string column to float
    def str_column_to_float(dataset, column):
        for row in dataset:
            row[column] = float(row[column].strip())


    # Load dataset
    filename = 'C:/Test.csv'
    dataset = load_csv(filename)


    # convert string columns to float
    for i in range(len(dataset[0])):
        str_column_to_float(dataset, i)

    # Transform to x and y
    x = []
    xpart = []
    y = []
    for row in dataset:
        for i in range(len(row)):
            if i != (len(row) - 1):
                xpart.append(row[i])
            else:
                y.append(row[i])
        x.append(xpart)
        xpart = []

    features_names = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
    labels = ['class1', 'class2']

    #here I tried to tune the parameters 
    #(I changed them several times, this is just an example to show, how the code looks like). 
    # However, I always ended up with terminal leaves with same classes
    """dtree=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
        max_features=8, max_leaf_nodes=None, min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf=1,
        min_samples_split=2, min_weight_fraction_leaf=0.0,
        presort=False, random_state=None, splitter='random')"""

    #here, I created the small example
    dtree = DecisionTreeClassifier(max_depth=2)
    dtree.fit(x,y)

    dot_data = tree.export_graphviz(dtree, out_file=None) 
    graph = graphviz.Source(dot_data) 
    graph.render("Result") 

    dot_data = tree.export_graphviz(dtree, out_file=None, 
                     feature_names= features_names,  
                     class_names=labels,  
                     filled=True, rounded=True,  
                     special_characters=True)  
    graph = graphviz.Source(dot_data)  
    graph.format = 'png'
    graph.render('Result', view = True)

...以及我的数据快照:

... and a snapshot of my Data:

在此处输入图片描述

推荐答案

您所指的 class 属性是该特定节点处的多数类,以及颜色来自您传递给 export_graphviz().

The class attribute you are referring to is the majority class at that particular node, and the colors come from the filled = True parameter you pass to export_graphviz().

现在,查看您的数据集,您有 147 个 class1 样本和 525 个 class2 样本,这是一个相当不平衡的比例.碰巧的是,您的特定数据集在此深度的最佳拆分会产生多数类为 class2 的拆分.这是正常行为,是您的数据的产物,考虑到 class2 的数量比 class1 高约 3:1,这并不奇怪.

Now, looking at your dataset, you have 147 samples of class1 and 525 samples of class2, which is a fairly imbalanced ratio. It just so happens that the optimal splits for your particular dataset at this depth produce splits where the majority class is class2. This is normal behaviour, and a product of your data, and not altogether surprising given that class2 outnumbers class1 by about 3:1.

至于为什么当分裂的两个孩子的多数类相同时树不会停止,这是因为算法的工作方式.如果没有最大深度,它会一直持续下去,直到它只生成只包含一个类的纯叶节点(并且 基尼杂质为0).您在示例中设置了 max_depth = 2,因此树在生成所有纯节点之前只是停止.

As to why the tree doesn't stop when the majority class is the same for the two children of a split, it's because of the way the algorithm works. If left unbounded with no max depth, it will continue until it produces only pure leaf nodes that contain a single class exclusively (and where the Gini impurity is 0). You've set max_depth = 2 in your example, so the tree simply stops before it can yield all pure nodes.

您会注意到,在示例中用红色框起来的分割中,右侧的节点几乎是 100% 的 class2,有 54 个 class2 实例,只有 2 个 class1.如果算法在此之前停止,它会产生上面的节点,具有 291-45 class2-class1,这就没那么有用了.

You'll notice that in the split you've boxed in red in your example, the node on the right is almost 100% class2, with 54 instances of class2 and only 2 of class1. If the algorithm had stopped before that it would produce the node above, with 291-45 class2-class1, which is far less useful.

也许您可以增加树的最大深度,看看是否可以进一步分离类.

Perhaps you could increase the max depth of your tree and see if you can separate out the classes further.

这篇关于决策树生成具有相同类的终端叶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆