如何解释决策树的图形结果并找到最有用的功能? [英] How to interpret decision trees' graph results and find most informative features?

查看:819
本文介绍了如何解释决策树的图形结果并找到最有用的功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sk-learn python 27并输出了一些决策树功能结果.虽然我不确定如何解释结果.起初,我认为这些功能是从信息最多到信息最少(从上到下)列出的,但是检查\ nvalue则表明是相反的.如何从输出或使用python行中确定最重要的5个最有用的功能?

I am using sk-learn python 27 and have output some decision tree feature results. Though I am not sure how to interpret the results. At first, I thought the features are listed from the most informative to least informative (from top to bottom), but examining the \nvalue it suggests otherwise. How do I identify the top 5 most informative features from the outputs or using python lines?

from sklearn import tree

tree.export_graphviz(classifierUsed2, feature_names=dv.get_feature_names(), out_file=treeFileName)     

# Output below
digraph Tree {
node [shape=box] ;
0 [label="avg-length <= 3.5\ngini = 0.0063\nsamples = 250000\nvalue = [249210, 790]"] ;
1 [label="name-entity <= 2.5\ngini = 0.5\nsamples = 678\nvalue = [338, 340]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="first-name=wm <= 0.5\ngini = 0.4537\nsamples = 483\nvalue = [168, 315]"] ;
1 -> 2 ;
3 [label="name-entity <= 1.5\ngini = 0.4016\nsamples = 435\nvalue = [121, 314]"] ;
2 -> 3 ;
4 [label="substring=ee <= 0.5\ngini = 0.4414\nsamples = 73\nvalue = [49, 24]"] ;
3 -> 4 ;
5 [label="substring=oy <= 0.5\ngini = 0.4027\nsamples = 68\nvalue = [49, 19]"] ;
4 -> 5 ;
6 [label="substring=im <= 0.5\ngini = 0.3589\nsamples = 64\nvalue = [49, 15]"] ;
5 -> 6 ;
7 [label="lastLetter-firstName=w <= 0.5\ngini = 0.316\nsamples = 61\nvalue = [49, 12]"] ;
6 -> 7 ;
8 [label="firstLetter-firstName=w <= 0.5\ngini = 0.2815\nsamples = 59\nvalue = [49, 10]"] ;
7 -> 8 ;
9 [label="substring=sa <= 0.5\ngini = 0.2221\nsamples = 55\nvalue = [48, 7]"] ;
... many many more lines below

推荐答案

  1. 在Python中,您可以使用DecisionTreeClassifier.feature_importances_,根据

    功能的重要性.越高,此功能越重要. 要素的重要性被计算为(归一化)总数 减少该功能带来的标准.也被称为 作为基尼重要性[R66].

    The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [R66].

    只需执行 np.argsort 有关功能重要性的信息,您将获得功能排名(不考虑领带).

    Simply do a np.argsort on the feature importances and you get a feature ranking (ties are not accounted for).

    您可以在基尼杂质(\ngini graphviz输出)获得第一个想法.越低越好.但是,请注意,如果一个功能在多个拆分中使用,您将需要一种合并杂质值的方法.通常,这是通过获取给定特征上所有分割的平均信息增益(或纯度增益")来完成的.如果您使用feature_importances_.

    You can look at the Gini impurity (\ngini in the graphviz output) to get a first idea. Lower is better. However, be aware that you will need a way to combine impurity values if a feature is used in more than one split. Typically, this is done by taking the average information gain (or 'purity gain') over all splits on a given feature. This is done for you if you use feature_importances_.

    修改: 我看到问题比我想的还要深. graphviz只是树的图形表示.它详细显示了树和树的每个拆分.这是树的表示,而不是要素的表示.这些功能的信息性(或重要性)实际上并不适合此表示形式,因为它会在树的多个节点上累积信息.

    Edit: I see the problem goes deeper than I thought. The graphviz thing is merely a graphical representation of the tree. It shows the tree and every split of the tree in detail. This is a representation of the tree, not of the features. Informativeness (or importance) of the features does not really fit into this representation because it accumulates information over multiple nodes of the tree.

    变量classifierUsed2.feature_importances_包含每个功能的重要性信息.例如,如果获得[0,0.2,0,0.1,...],则第一个特征的重要性为0,第二个特征的重要性为0.2,第三个特征的重要性为0,第四个特征的重要性0.1,依此类推.

    The variable classifierUsed2.feature_importances_ contains importance information for every feature. If you get for example [0, 0.2, 0, 0.1, ...] the first feature has an importance of 0, the second feature has an importance of 0.2, the third feature has an importance of 0, the fourth feature an importance of 0.1, and so on.

    让我们按重要性排序功能(最重要的是最重要的):

    Let's sort features by their importance (most important first):

    rank = np.argsort(classifierUsed2.feature_importances_)[::-1]
    

    现在的排名包含要素的索引,从最重要的要素开始:[1、3、0、1,...]

    Now rank contains the indices of the features, starting with the most important one: [1, 3, 0, 1, ...]

    想了解五个最重要的功能吗?

    Want to see the five most important features?

    print(rank[:5])
    

    这将打印索引.什么索引对应什么功能?那是您应该了解的东西,因为您应该构造特征矩阵.很有可能这样:

    This prints the indices. What index corresponds to what feature? That's something you should know yourself because you supposedly constructed the feature matrix. Chances are, that this works:

    print(dv.get_feature_names()[rank[:5]])
    

    或者也许是这样

    print('\n'.join(dv.get_feature_names()[i] for i in rank[:5]))
    

    这篇关于如何解释决策树的图形结果并找到最有用的功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆