从SKlearn决策树中检索决策边界线(x,y坐标格式) [英] Retrieve Decision Boundary Lines (x,y coordinate format) from SKlearn Decision Tree

查看:49
本文介绍了从SKlearn决策树中检索决策边界线(x,y坐标格式)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在外部可视化平台上创建表面图.我正在使用

我只是不确定如何获取控制这些行的数据(通过 xx yy Z 或其他方式)?).

注意:我对包含行格式的列表/或数据结构的确切格式并不挑剔,只要其计算效率高.例如,对于上面的第一个图,某些红色区域实际上是预测空间中的孤岛,因此这可能意味着我们必须像对待它自己的线一样处理它.我猜只要类与 x,y 坐标相结合,使用多少个数组(包含坐标)来捕获决策边界应该无关紧要.

解决方案

决策树没有很好的边界.它们有多个边界,将特征空间分层划分为矩形区域.

在实现

不同颜色的区域相遇都存在一个决策边界.我想只要花些力气就可以提取这些边界线,但我会将其留给任何有兴趣的人.

rectangles 是一个numpy数组.每一行对应一个矩形,列分别为[left, right, top, bottom, class].

<小时>

更新:应用到 Iris 数据集

虹膜数据集包含三个类别,而不是2个类别,如示例中所示.所以我们必须给 plot_areas 函数添加另一种颜色:color = ['b', 'r', 'g'][int(rect[4])].此外,数据集是 4 维的(它包含四个特征),但我们只能在 2D 中绘制两个特征.我们需要选择绘制哪些特征并告诉 decision_area 函数.该函数带有两个参数 x y -这些分别是x和y轴上的功能.默认值为 x=0, y=1,它适用于具有多个特征的任何数据集.但是,在 Iris 数据集中,第一维不是很有趣,因此我们将使用不同的设置.

函数decision_areas 也不知道数据集的范围.通常,决策树具有扩展到无穷大的开放决策范围(例如,每当 萼片长度 小于 xyz 时,它就是 B 类).在这种情况下,我们需要人为地缩小绘图范围.我为示例数据集选择了 -3..3 ,但对于虹膜数据集,其他范围是合适的(永不为负值,某些特征超出了3).

在这里,我们在 0..7 和 0..5 的范围内绘制了最后两个特征的决策区域:

来自sklearn.datasets的

 导入load_iris数据 = load_iris()x = 数据.数据y = 数据.目标dtc = DecisionTreeClassifier().fit(x, y)矩形= Decision_areas(dtc,[0,7,0,5],x = 2,y = 3)plt.scatter(x [:, 2],x [:, 3],c = y)plot_areas(矩形)

请注意,左上方的红色和绿色区域之间有奇怪的重叠.发生这种情况是因为树在四个维度上做出决策,但我们只能显示两个维度.没有真正干净的方法来解决这个问题.高维分类器在低维空间中通常没有很好的决策边界.

因此,如果您对分类器更感兴趣,那么您会得到.您可以沿尺寸的各种组合生成不同的视图,但是表示的有用性受到限制.

但是,如果您对数据比对分类器更感兴趣,则可以在拟合之前限制维数.在这种情况下,分类器仅在二维空间中做出决策,我们可以绘制出不错的决策区域:

from sklearn.datasets import load_iris数据 = load_iris()x = data.data [:, [2,3]]y = 数据.目标dtc = DecisionTreeClassifier().fit(x,y)矩形= Decision_areas(dtc,[0,7,0,3],x = 0,y = 1)plt.scatter(x[:, 0], x[:, 1], c=y)plot_areas(矩形)

<小时>

最后,这是实现:

将 numpy 导入为 np从集合导入双端队列从sklearn.tree导入DecisionTreeClassifier从 sklearn.tree 导入 _tree 作为 ctree导入matplotlib.pyplot作为plt从matplotlib.patches导入RectangleAABB类:"轴对齐的边界框""def __init__(self, n_features):self.limits = np.array([[-np.inf, np.inf]] * n_features)def split(self,f,v):left = AABB(self.limits.shape[0])正确= AABB(self.limits.shape [0])left.limits = self.limits.copy()right.limits = self.limits.copy()left.limits[f, 1] = vright.limits[f, 0] = v返回左,右def tree_bounds(tree,n_features = None):"""计算树中每个节点的最终决策规则"""如果 n_features 为 None:n_features = np.max(tree.feature) + 1aabbs = [AA范围(tree.node_count)中的_的AABB(n_features)]队列=双端队列([0])排队时:我= queue.pop()l = tree.children_left[i]r = tree.children_right[i]如果 l != ctree.TREE_LEAF:aabbs [l],aabbs [r] = aabbs [i] .split(tree.feature [i],tree.threshold [i])queue.extend([l, r])返回aabbsdef Decision_areas(tree_classifier,maxrange,x = 0,y = 1,n_features = None):"提取决策区域.tree_classifier:sklearn.tree.DecisionTreeClassifier的实例maxrange:如果间隔是开放的(+/-inf)要为[左,右,上,下]插入的值x:x 轴上特征的索引y:y 轴上特征的索引n_features:覆盖特征数量的自动检测"""树= tree_classifier.tree_aabbs = tree_bounds(tree,n_features)矩形= []对于范围内的我(len(aabbs)):如果 tree.children_left[i] != ctree.TREE_LEAF:继续l = aabbs[i].limitsr = [l [x,0],l [x,1],l [y,0],l [y,1],np.argmax(tree.value [i])]矩形.附加(r)矩形= np.array(rectangles)矩形[:,[0,2]] = np.maximum(矩形[:,[0,2]],最大范围[0 :: 2])矩形[:, [1, 3]] = np.minimum(矩形[:, [1, 3]], maxrange[1::2])返回矩形def plot_areas(矩形):对于矩形中的 rect:颜色= ['b','r'] [int(rect [4])]打印(矩形 [0],矩形 [1],矩形 [2] - 矩形 [0],矩形 [3] - 矩形 [1])rp = Rectangle([rect [0],rect [2]],rect [1]-rect [0],rect[3] - rect[2],颜色=颜色,alpha=0.3)plt.gca().add_artist(rp)

I am trying to create a surface plot on an external visualization platform. I'm working with the iris data set that is featured on the sklearn decision tree documentation page. I'm also using the same approach to create my decision surface plot. My end goal though is not the matplot lib visual, so from here I input the data to my visualization software. To do this I just called flatten() and tolist() on xx, yy and Z and wrote a JSON file containing these lists.

The trouble is when I try to plot it, my visualization program crashes. It turns out the data is too large. When flattened the length of the list is >86,000. This is due to the fact the step size/plot step is very small .02. So it is essentially taking baby steps across the domain of the data's min and max and plotting/filling as it goes, according to the model's predictions.It's kind of like a pixel-grid; I shrunk the size down to an array of only 2000 and noticed that the coordinates were just lines going back and forth (eventually encompassing the entire coordinate plane).

Question: Can I retrieve the x,y coordinates of the decision boundary lines themselves (as opposed to iterating across the whole plane)? Ideally a list containing only the turning points of each line. Or alternatively, is there maybe some other completely different way to recreate this plot, so that it is more computationally efficient?

This can somewhat be visualized by replacing the contourf() call with countour():

I'm just not sure how to retrieve the data governing those lines (via xx, yy and Z or possibly other means?).

Note: I'm not picky about the exact format of the list/or data structure that contains the lines format as long as its computationally efficient. For instance, for the first plot above, some red areas are actually islands in the prediction space, so that might mean we'd have to handle it like it's its own line. I'm guessing as long as the class is coupled with the x,y coordinates, it shouldn't matter how many arrays (containing coordinates)are used to capture the decision boundaries.

解决方案

Decision trees do not have very nice boundaries. They have multiple boundaries that hierarchically split the feature space into rectangular regions.

In my implementation of Node Harvest I wrote functions that parse scikit's decision trees and extract the decision regions. For this answer I modified parts of that code to return a list of rectangles that correspond to a trees decision regions. It should be easy to draw these rectangles with any plotting library. Here is an example using matplotlib:

n = 100
np.random.seed(42)
x = np.concatenate([np.random.randn(n, 2) + 1, np.random.randn(n, 2) - 1])
y = ['b'] * n + ['r'] * n
plt.scatter(x[:, 0], x[:, 1], c=y)

dtc = DecisionTreeClassifier().fit(x, y)
rectangles = decision_areas(dtc, [-3, 3, -3, 3])
plot_areas(rectangles)
plt.xlim(-3, 3)
plt.ylim(-3, 3)

Wherever regions of different color meet there is a decision boundary. I imagine it would be possible with moderate effort to extract just these boundary lines but I'll leave that to anyone who is interested.

rectangles is a numpy array. Each row corresponds to one rectangle and the columns are [left, right, top, bottom, class].


Update: Application to the Iris data set

The Iris data set contains three classes instead of 2, like in the example. So we have to add another color to the plot_areas function: color = ['b', 'r', 'g'][int(rect[4])]. Furthermore, the data set is 4-dimensional (it contains four features) but we can only plot two features in 2D. We need to chose which features to plot and tell the decision_area function. The function takes two arguments x and y - these are the features that go on the x and y axis, respectively. The default is x=0, y=1 which works with any data set that has more than one feature. However, in the Iris data set the first dimension is not very interesting so we will use a different setting.

The function decision_areas also does not know about the extent of the data set. Often the decision tree has open decision ranges that extend toward infinity (e.g. Whenever sepal length is less than xyz it's class B). In this case we need to artificially narrow down the range for plotting. I chose -3..3 for the example data set but for the iris data set other ranges are appropriate (there are never negative values, some features extend beyond 3).

Here we plot the decision regions over the two last features in a range of 0..7 and 0..5:

from sklearn.datasets import load_iris
data = load_iris()
x = data.data
y = data.target
dtc = DecisionTreeClassifier().fit(x, y)
rectangles = decision_areas(dtc, [0, 7, 0, 5], x=2, y=3)
plt.scatter(x[:, 2], x[:, 3], c=y)
plot_areas(rectangles)

Note how there is a weird overlap of the red and green areas in the top left. This happens because the tree makes decisions in four dimensions but we can show only two. There is not really a clean way around this. A high dimensional classifier often has no nice decision boundaries in low-dimensional space.

So if you are more interested in the classifier that is what you get. You can generate different views along various combinations of dimensions but there are limits to the usefulness of the representation.

However, if you are more interested in the data than in the classifier you can restrict the dimensionality before fitting. In that case the classifier only makes decisions in the 2-dimensional space and we can plot nice decision regions:

from sklearn.datasets import load_iris
data = load_iris()
x = data.data[:, [2, 3]]
y = data.target
dtc = DecisionTreeClassifier().fit(x, y)
rectangles = decision_areas(dtc, [0, 7, 0, 3], x=0, y=1)
plt.scatter(x[:, 0], x[:, 1], c=y)
plot_areas(rectangles)


Finally, here is the implementation:

import numpy as np
from collections import deque
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import _tree as ctree
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle


class AABB:
    """Axis-aligned bounding box"""
    def __init__(self, n_features):
        self.limits = np.array([[-np.inf, np.inf]] * n_features)

    def split(self, f, v):
        left = AABB(self.limits.shape[0])
        right = AABB(self.limits.shape[0])
        left.limits = self.limits.copy()
        right.limits = self.limits.copy()

        left.limits[f, 1] = v
        right.limits[f, 0] = v

        return left, right


def tree_bounds(tree, n_features=None):
    """Compute final decision rule for each node in tree"""
    if n_features is None:
        n_features = np.max(tree.feature) + 1
    aabbs = [AABB(n_features) for _ in range(tree.node_count)]
    queue = deque([0])
    while queue:
        i = queue.pop()
        l = tree.children_left[i]
        r = tree.children_right[i]
        if l != ctree.TREE_LEAF:
            aabbs[l], aabbs[r] = aabbs[i].split(tree.feature[i], tree.threshold[i])
            queue.extend([l, r])
    return aabbs


def decision_areas(tree_classifier, maxrange, x=0, y=1, n_features=None):
    """ Extract decision areas.

    tree_classifier: Instance of a sklearn.tree.DecisionTreeClassifier
    maxrange: values to insert for [left, right, top, bottom] if the interval is open (+/-inf) 
    x: index of the feature that goes on the x axis
    y: index of the feature that goes on the y axis
    n_features: override autodetection of number of features
    """
    tree = tree_classifier.tree_
    aabbs = tree_bounds(tree, n_features)

    rectangles = []
    for i in range(len(aabbs)):
        if tree.children_left[i] != ctree.TREE_LEAF:
            continue
        l = aabbs[i].limits
        r = [l[x, 0], l[x, 1], l[y, 0], l[y, 1], np.argmax(tree.value[i])]
        rectangles.append(r)
    rectangles = np.array(rectangles)
    rectangles[:, [0, 2]] = np.maximum(rectangles[:, [0, 2]], maxrange[0::2])
    rectangles[:, [1, 3]] = np.minimum(rectangles[:, [1, 3]], maxrange[1::2])
    return rectangles

def plot_areas(rectangles):
    for rect in rectangles:
        color = ['b', 'r'][int(rect[4])]
        print(rect[0], rect[1], rect[2] - rect[0], rect[3] - rect[1])
        rp = Rectangle([rect[0], rect[2]], 
                       rect[1] - rect[0], 
                       rect[3] - rect[2], color=color, alpha=0.3)
        plt.gca().add_artist(rp)

这篇关于从SKlearn决策树中检索决策边界线(x,y坐标格式)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆