使用 scikit 确定每个特征对特定类别预测的贡献 [英] Using scikit to determine contributions of each feature to a specific class prediction

查看:29
本文介绍了使用 scikit 确定每个特征对特定类别预测的贡献的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scikit 额外树分类器:

I am using a scikit extra trees classifier:

model = ExtraTreesClassifier(n_estimators=10000, n_jobs=-1, random_state=0)

一旦模型被拟合并用于预测类别,我想找出每个特征对特定类别预测的贡献.我如何在 scikit learn 中做到这一点?是否可以使用额外的树分类器,还是需要使用其他模型?

Once the model is fitted and used to predict classes, I would like to find out the contributions of each feature to a specific class prediction. How do I do that in scikit learn? Is it possible with extra trees classifier or do I need to use some other model?

推荐答案

更新

与 2.5 年前相比,今天对机器学习的了解更多,现在我要说这种方法仅适用于高度线性的决策问题.如果您不小心将其应用于非线性问题,您将遇到麻烦.

Update

Being more knowledgable about ML today than I was 2.5 years ago, I will now say this approach only works for highly linear decision problems. If you carelessly apply it to a non-linear problem you will have trouble.

示例: 想象一个特征,它的值既不是非常大也不是非常小,但可以预测某个中间区间的值.这可能是预测脱水的水摄入量.但是水的摄入量可能与盐的摄入量相互作用,因为吃更多的盐会导致更多的水摄入量.现在您有了两个非线性特征之间的相互作用.决策边界在您的特征空间周围蜿蜒以模拟这种非线性并仅询问其中一个特征对脱水风险的影响程度是无知的.这不是正确的问题.

Example: Imagine a feature for which neither very large nor very small values predict a class, but values in some intermediate interval do. That could be water intake to predict dehydration. But water intake probably interacts with salt intake, as eating more salt allows for a greater water intake. Now you have an interaction between two non-linear features. The decision boundary meanders around your feature-space to model this non-linearity and to ask only how much one of the features influences the risk of dehydration is simply ignorant. It is not the right question.

替代方案:您可以提出的另一个更有意义的问题是:如果我没有这些信息(如果我遗漏了此功能),我对给定标签的预测会受到多大影响?为此,您只需省略一个特征,训练一个模型,然后查看每个类的精度和召回率下降了多少.它仍然会告知特征重要性,但不会对线性做出任何假设.

Alternative: Another, more meaningful, question you could ask is: If I didn't have this information (if I left out this feature) how much would my prediction of a given label suffer? To do this you simply leave out a feature, train a model and look at how much precision and recall drops for each of your classes. It still informs about feature importance, but it makes no assumptions about linearity.

以下是旧答案.

我不久前解决了一个类似的问题,并在交叉验证中发布了相同的问题.简短的回答是 sklearn 中没有实现您想要的所有功能.

I worked through a similar problem a while back and posted the same question on Cross Validated. The short answer is that there is no implementation in sklearn that does all of what you want.

但是,您要实现的目标确实非常简单,可以通过将每个类上拆分的每个特征的平均标准化平均值与相应的 model._feature_importances 数组元素相乘来完成.您可以编写一个简单的函数来标准化您的数据集,计算跨类预测拆分的每个特征的平均值,并与 model._feature_importances 数组进行元素乘法.结果的绝对值越大,特征对其预测类别的重要性就越大,更好的是,该符号会告诉您是小值还是大值重要.

However, what you are trying to achieve is really quite simple, and can be done by multiplying the average standardised mean value of each feature split on each class, with the corresponding model._feature_importances array element. You can write a simple function that standardises your dataset, computes the mean of each feature split across class predictions, and does element-wise multiplication with the model._feature_importances array. The greater the absolute resulting values are, the more important the features will be to their predicted class, and better yet, the sign will tell you if it is small or large values that are important.

这是一个超级简单的实现,它采用数据矩阵X、预测列表Y和特征重要性数组,并输出一个描述每个特征对每个类的重要性的 JSON.

Here's a super simple implementation that takes a datamatrix X, a list of predictions Y and an array of feature importances, and outputs a JSON describing importance of each feature to each class.

def class_feature_importance(X, Y, feature_importances):
    N, M = X.shape
    X = scale(X)

    out = {}
    for c in set(Y):
        out[c] = dict(
            zip(range(N), np.mean(X[Y==c, :], axis=0)*feature_importances)
        )

    return out

示例:

import numpy as np
import json
from sklearn.preprocessing import scale

X = np.array([[ 2,  2,  2,  0,  3, -1],
              [ 2,  1,  2, -1,  2,  1],
              [ 0, -3,  0,  1, -2,  0],
              [-1, -1,  1,  1, -1, -1],
              [-1,  0,  0,  2, -3,  1],
              [ 2,  2,  2,  0,  3,  0]], dtype=float)

Y = np.array([0, 0, 1, 1, 1, 0])
feature_importances = np.array([0.1, 0.2, 0.3, 0.2, 0.1, 0.1])
#feature_importances = model._feature_importances

result = class_feature_importance(X, Y, feature_importances)

print json.dumps(result,indent=4)

{
    "0": {
        "0": 0.097014250014533204, 
        "1": 0.16932975630904751, 
        "2": 0.27854300726557774, 
        "3": -0.17407765595569782, 
        "4": 0.0961523947640823, 
        "5": 0.0
    }, 
    "1": {
        "0": -0.097014250014533177, 
        "1": -0.16932975630904754, 
        "2": -0.27854300726557779, 
        "3": 0.17407765595569782, 
        "4": -0.0961523947640823, 
        "5": 0.0
    }
}

result中的第一级键是类标签,第二级键是列索引,即特征索引.回想一下,大的绝对值对应于重要性,符号告诉您是小(可能是负数)还是大值很重要.

The first level of keys in result are class labels, and the second level of keys are column-indices, i.e. feature-indices. Recall that large absolute values corresponds to importance, and the sign tells you whether it's small (possibly negative) or large values that matter.

这篇关于使用 scikit 确定每个特征对特定类别预测的贡献的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆