在决策树中查找到决策边界的距离 [英] Find Distance to Decision Boundary in Decision Trees

查看:347
本文介绍了在决策树中查找到决策边界的距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在

I want to find the distance of samples to the decision boundary of a trained decision trees classifier in scikit-learn. The features are all numeric and the feature space could be of any size.

到目前为止,对于基于

I have this visualization so far for an example 2D case based on here:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons

# Generate some example data
X, y = make_moons(noise=0.3, random_state=0)

# Train the classifier
clf = DecisionTreeClassifier(max_depth=2)

clf.fit(X, y)

# Plot
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k')
plt.xlabel('a'); plt.ylabel('b');

我了解到,对于其他一些分类器(例如SVM),可以通过数学方法计算此距离[ 2 4 6 ]:

I understand that for some other classifiers like SVM, this distance can be mathematically calculated [1, 2, 3]. The rules learned after training a decision trees define the boundaries and may also be helpful to algorithmically calculate the distances [4, 5, 6]:

# Plot the trained tree
from sklearn import tree
import graphviz 
dot_data = tree.export_graphviz(clf, feature_names=['a', 'b'],  class_names=['1', '2'], filled=True)  
graph = graphviz.Source(dot_data)  

推荐答案

由于一个样本周围可能有多个决策边界,因此我将假设距离是指到最近决策边界的距离.

Since there can be multiple decision boundaries around a sample, I'm going to assume distance here refers to distance to nearest decision boundary.

解决方案是递归树遍历算法.请注意,决策树不允许样本位于边界上,例如支持向量机,特征空间中的每个样本都必须属于其中一个类.因此,在这里,我们将继续逐步修改样本的特征,并且只要该区域导致具有不同标签的区域(而不是经过训练的分类器最初分配给样本的区域),我们就认为我们已经达到了决策边界.

The solution is a recursive tree traversal algorithm. Note that decision tree doesn't allow a sample to be on boundary, like e.g. SVM, each sample in feature space must belong to one of the classes. So here we will keep modifying the sample's feature in small steps, and whenever that leads to a region with a different label (than one originally assigned to the sample by trained classifier), we assume we've reached decision boundary.

详细地说,像任何递归算法一样,我们要考虑两种主要情况:

In detail, like any recursive algorithm, we have two main cases to consider:

  1. 基本情况,即我们位于叶节点.我们只需检查当前样本是否具有不同的标签:如果是,则返回它,否则返回None.
  2. 非叶节点.有两个分支,我们将样本发送到两个分支.我们不会修改示例以将其发送到自然需要的分支.但是在将其发送到另一个分支之前,我们先查看节点的(特征,阈值)对,并修改样本的给定特征,使其恰好足以将其推入阈值的另一侧.

完整的python代码:

Complete python code:

def f(node,x,orig_label):
    global dt,tree
    if tree.children_left[node]==tree.children_right[node]: #Meaning node is a leaf
        return [x] if dt.predict([x])[0]!=orig_label else [None]

    if x[tree.feature[node]]<=tree.threshold[node]:
        orig = f(tree.children_left[node],x,orig_label)
        xc = x.copy()
        xc[tree.feature[node]] = tree.threshold[node] + .01
        modif = f(tree.children_right[node],xc,orig_label)
    else:
        orig = f(tree.children_right[node],x,orig_label)
        xc = x.copy()
        xc[tree.feature[node]] = tree.threshold[node] 
        modif = f(tree.children_left[node],xc,orig_label)
    return [s for s in orig+modif if s is not None]

这将返回给我们导致带有不同标签的叶子的样品列表.我们现在要做的就是取最接近的一个:

This is going to return us a list of samples that lead to leaves with different label. All we need to do now is to take the nearest one:

dt =  DecisionTreeClassifier(max_depth=2).fit(X,y)
tree = dt.tree_
res = f(0,x,dt.predict([x])[0]) # 0 is index of root node
ans = np.min([np.linalg.norm(x-n) for n in res]) 

例如:

蓝色是原始样本,黄色是决策边界上最近的样本.

Blue is the original sample, yellow is the nearest sample "on" decision boundary.

这篇关于在决策树中查找到决策边界的距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆