scikit learn - 决策树中的特征重要性计算 [英] scikit learn - feature importance calculation in decision trees

查看:42
本文介绍了scikit learn - 决策树中的特征重要性计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何在 sci-kit learn 中为决策树计算特征重要性.之前已经问过这个问题,但我无法重现算法提供的结果.

例如:

from StringIO import StringIO从 sklearn.datasets 导入 load_iris从 sklearn.tree 导入 DecisionTreeClassifier从 sklearn.tree.export 导入 export_graphviz从 sklearn.feature_selection 导入mutual_info_classifX = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]y = [1,0,1,1]clf = DecisionTreeClassifier()clf.fit(X, y)feat_importance = clf.tree_.compute_feature_importances(normalize=False)打印(专长重要性=+ str(feat_importance))输出 = StringIO()out = export_graphviz(clf, out_file='test/tree.dot')

导致特征重要性:

专长重要性 = [0.25 0.08333333 0.04166667]

并给出以下决策树:

现在,这个对类似问题的

其中 G 是节点杂质,在这种情况下是基尼杂质.据我所知,这是杂质减少.但是,对于功能 1,这应该是:

这个

两个公式都提供了错误的结果.如何正确计算特征重要性?

解决方案

我认为功能重要性取决于实现,所以我们需要查看 scikit-learn 的文档.

<块引用>

特征重要性.越高,特征越重要.特征的重要性被计算为该特征带来的标准的(归一化)总减少.它也被称为基尼重要性

减少或加权信息增益定义为:

<块引用>

加权杂质减少方程如下:

N_t/N * (杂质 - N_t_R/N_t * right_impurity- N_t_L/N_t * left_impurity)

其中N为样本总数,N_t为当前节点的样本数,N_t_L为左子节点的样本数,N_t_R为右子节点的样本数.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

由于在您的案例中每个特征都使用一次,因此特征信息必须等于上面的等式.

对于 X[2] :

feature_importance = (4/4) * (0.375 - (0.75 * 0.444)) = 0.042

对于 X[1] :

feature_importance = (3/4) * (0.444 - (2/3 * 0.5)) = 0.083

对于 X[0] :

feature_importance = (2/4) * (0.5) = 0.25

I'm trying to understand how feature importance is calculated for decision trees in sci-kit learn. This question has been asked before, but I am unable to reproduce the results the algorithm is providing.

For example:

from StringIO import StringIO

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_graphviz
from sklearn.feature_selection import mutual_info_classif

X = [[1,0,0], [0,0,0], [0,0,1], [0,1,0]]

y = [1,0,1,1]

clf = DecisionTreeClassifier()
clf.fit(X, y)

feat_importance = clf.tree_.compute_feature_importances(normalize=False)
print("feat importance = " + str(feat_importance))

out = StringIO()
out = export_graphviz(clf, out_file='test/tree.dot')

results in feature importance:

feat importance = [0.25       0.08333333 0.04166667]

and gives the following decision tree:

Now, this answer to a similar question suggests the importance is calculated as

Where G is the node impurity, in this case the gini impurity. This is the impurity reduction as far as I understood it. However, for feature 1 this should be:

This answer suggests the importance is weighted by the probability of reaching the node (which is approximated by the proportion of samples reaching that node). Again, for feature 1 this should be:

Both formulas provide the wrong result. How is the feature importance calculated correctly?

解决方案

I think feature importance depends on the implementation so we need to look at the documentation of scikit-learn.

The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance

That reduction or weighted information gain is defined as :

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

Since each feature is used once in your case, feature information must be equal to equation above.

For X[2] :

feature_importance = (4 / 4) * (0.375 - (0.75 * 0.444)) = 0.042

For X[1] :

feature_importance = (3 / 4) * (0.444 - (2/3 * 0.5)) = 0.083

For X[0] :

feature_importance = (2 / 4) * (0.5) = 0.25

这篇关于scikit learn - 决策树中的特征重要性计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆