使用sklearn python通过决策树提取数据点的规则路径 [英] Extract rule path of data point through decision tree with sklearn python

查看:1126
本文介绍了使用sklearn python通过决策树提取数据点的规则路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用决策树模型,我想提取每个数据点的决策路径,以便了解造成Y的原因而不是进行预测。
我该怎么做?找不到任何文档。

解决方案

以下是使用 iris数据集的示例。

 从sklearn.datasets导入load_iris 
从sklearn导入树
导入graphviz

iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data,iris.target)

dot_data = tree.export_graphviz(clf,out_file =无,
feature_names = iris.feature_names,
class_names = iris.target_names,
fill =正确,四舍五入=正确,
special_characters = True)
图= graphviz .Source(dot_data)
#这将创建一个带有规则路径
graph.render( iris)







编辑:以下代码来自sklearn文档,进行了一些小的更改以实现您的目标



 从sklearn.model_selection导入numpy为np 
从sklearn.datasets导入train_test_split
从sklearn.tree导入load_iris
tree导入DecisionTreeClassifier

虹膜= load_iris()
X = iris.data
y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)

estimator = DecisionTreeClassifier(max_leaf_nodes = 3,random_state = 0)
estimator.fit(X_train,y_train)

#决策估计器具有一个名为tree_的属性,该属性存储整个
#树结构,并允许访问低级属性。二进制树
#tree_表示为多个并行数组。每个
#数组的第i个元素保存有关节点 i的信息。节点0是树的根。注意:
#某些数组仅适用于叶子节点或拆分节点。在这种
#情况下,其他类型的节点的值是任意的!

#在这些数组中,我们有:
#-left_child,节点左孩子的ID
#-right_child,节点右孩子的
#-特征,用于拆分节点的特征
#-阈值,节点

的阈值n_nodes = estimator.tree_.node_count
children_left = estimator.tree_ .children_left
children_right = estimator.tree_.children_right
功能= estimator.tree_.feature
阈值= estimator.tree_.threshold

#可以遍历树结构计算各种属性,例如
#作为每个节点的深度以及是否为叶子。
node_depth = np.zeros(shape = n_nodes,dtype = np.int64)
is_leaves = np.zeros(shape = n_nodes,dtype = bool)
stack = [(0,-1 )]#seed是根节点ID及其父级深度
,而len(stack)> 0:
node_id,parent_depth = stack.pop()
node_depth [node_id] = parent_depth + 1

#如果我们有一个测试节点
if(children_left [ node_id]!= children_right [node_id]):
stack.append((children_left [node_id],parent_depth + 1))
stack.append((children_right [node_id],parent_depth + 1))
else:
is_leaves [node_id] =真

print(二进制树结构具有%s个节点,并且具有
以下树结构:
%n_nodes)
for i in range(n_nodes):
if is_leaves [i]:
print(%snode =%s叶子节点。%(node_depth [i] * \t,i))
else:
print(%snode =%s测试节点:如果X [:,%s]< =%s else to
节点%s。
%(node_depth [i] * \t,
i,
children_left [i],
功能[i],
个门槛[i],
个children_ri ght [i],
))
print()

#首先,我们获取每个样本的决策路径。 Decision_path
#方法允许检索节点指示符函数。在位置(i,j)处
#指标矩阵的非零元素表示样本i通过节点j进入
#。

node_indicator = estimator.decision_path(X_test)

#同样,我们还可以使每个样本达到叶子ID。

Leave_id = estimator.apply(X_test)

#现在,可以获取用于预测样本的测试或
#一组样本。首先,让我们作为示例。

#这里是您想要的
sample_id = 0
node_index = node_indicator.indices [node_indicator.indptr [sample_id]:
node_indicator.indptr [sample_id + 1] ]

print('用于预测样本%s的规则:'%sample_id)node_index中的node_id的


if if_id [sample_id] == node_id: #<-更改!=到==
#continue#<-注释掉
print(到达叶子节点{},在这里没有决定。format(leave_id [sample_id])) #<-

其他:#< -如果(X_test [sample_id,feature [node_id]]< = threshold [node_id]]< = threshold [node_id]):
threshold_sign =< =
else:
threshold_sign =>

print(决策id节点%s:(X [%s,%s](=%s)%s%s)
%(node_id,
sample_id ,
feature [node_id],
X_test [sample_id,feature [node_id]],#<-将i更改为sample_id
threshold_sign,
threshold [node_id]))






这将在末尾打印:



用于预测样本0的规则:
决策ID节点0:(X [0,3](= 2.4)> 0.800000011920929)
个决策ID节点2:(X [0,2](= 5.1)> 4.949999809265137)
个叶子节点4已到达,此处没有决策





I'm using decision tree model and I want to extract the decision path for each data point in order to understand what caused the Y rather than to predict it. How can I do that? Couldn't find any documentation.

解决方案

Here is an example using the iris dataset.

from sklearn.datasets import load_iris
from sklearn import tree
import graphviz 

iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=iris.feature_names,  
                                class_names=iris.target_names,  
                                filled=True, rounded=True,  
                                special_characters=True)  
graph = graphviz.Source(dot_data)  
#this will create an iris.pdf file with the rule path
graph.render("iris")


EDIT: the following code is from the sklearn documentation with some small changes to address your goal

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
estimator.fit(X_train, y_train)

# The decision estimator has an attribute called tree_  which stores the entire
# tree structure and allows access to low level attributes. The binary tree
# tree_ is represented as a number of parallel arrays. The i-th element of each
# array holds information about the node `i`. Node 0 is the tree's root. NOTE:
# Some of the arrays only apply to either leaves or split nodes, resp. In this
# case the values of nodes of the other type are arbitrary!
#
# Among those arrays, we have:
#   - left_child, id of the left child of the node
#   - right_child, id of the right child of the node
#   - feature, feature used for splitting the node
#   - threshold, threshold value at the node

n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold

# The tree structure can be traversed to compute various properties such
# as the depth of each node and whether or not it is a leaf.
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))
print()

# First let's retrieve the decision path of each sample. The decision_path
# method allows to retrieve the node indicator functions. A non zero element of
# indicator matrix at the position (i, j) indicates that the sample i goes
# through the node j.

node_indicator = estimator.decision_path(X_test)

# Similarly, we can also have the leaves ids reached by each sample.

leave_id = estimator.apply(X_test)

# Now, it's possible to get the tests that were used to predict a sample or
# a group of samples. First, let's make it for the sample.

# HERE IS WHAT YOU WANT
sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
                                    node_indicator.indptr[sample_id + 1]]

print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:

    if leave_id[sample_id] == node_id:  # <-- changed != to ==
        #continue # <-- comment out
        print("leaf node {} reached, no decision here".format(leave_id[sample_id])) # <--

    else: # < -- added else to iterate through decision nodes
        if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
            threshold_sign = "<="
        else:
            threshold_sign = ">"

        print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
              % (node_id,
                 sample_id,
                 feature[node_id],
                 X_test[sample_id, feature[node_id]], # <-- changed i to sample_id
                 threshold_sign,
                 threshold[node_id]))


This will print at the end the following:

Rules used to predict sample 0: decision id node 0 : (X[0, 3] (= 2.4) > 0.800000011920929) decision id node 2 : (X[0, 2] (= 5.1) > 4.949999809265137) leaf node 4 reached, no decision here


这篇关于使用sklearn python通过决策树提取数据点的规则路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆