如何使用权重在逻辑回归中获得特征重要性? [英] How to get feature importance in logistic regression using weights?

查看:393
本文介绍了如何使用权重在逻辑回归中获得特征重要性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个评论数据集,其类别标签为正面/负面.我正在将Logistic回归应用于该评论数据集.首先,我要转换成单词包.此处 sorted_data ['文本'] 评论,而 final_counts 稀疏矩阵

I have a dataset of reviews which has a class label of positive/negative. I am applying Logistic regression to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix

count_vect = CountVectorizer() 
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
standardized_data = StandardScaler(with_mean=False).fit_transform(final_counts)

将数据集分为训练和测试

split the data set into train and test

X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)

我正在如下应用逻辑回归算法

I am applying the logistic regression algorithm as follows

optimal_lambda = 0.001000
log_reg_optimal = LogisticRegression(C=optimal_lambda)

# fitting the model
log_reg_optimal.fit(X_tr, y_tr)

# predict the response
pred = log_reg_optimal.predict(X_test)

# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the Logistic Regression for C = %f is %f%%' % (optimal_lambda, acc))

我的体重是

weights = log_reg_optimal.coef_ .   #<class 'numpy.ndarray'>

array([[-0.23729528, -0.16050616, -0.1382504 , ...,  0.27291847,
         0.35857267,  0.41756443]])
(1, 38178) #shape of weights

我想了解功能的重要性,即;高权重的前100个功能.谁能告诉我如何获得它们?

I want to get the feature importance i.e; top 100 features which have high weights. Could anyone tell me how to get them?

推荐答案

一种调查给定功能/参数中的"影响力"或"重要性"的方法线性分类模型是要考虑系数幅度.

One way to investigate the "influence" or "importance" of a given feature / parameter in a linear classification model is to consider the magnitude of the coefficients.

这是最基本的方法.其他用于发现特征重要性或参数影响的技术可以提供更多的见解,例如使用 p值自举得分,各种判别指标",等

This is the most basic approach. Other techniques for finding feature importance or parameter influence could provide more insight such as using p-values, bootstrap scores, various "discriminative indices", etc.

在这里,您已经对数据进行了标准化,因此直接使用此功能:

Here, you have standardized the data so use directly this:

weights = log_reg_optimal.coef_
abs_weights = np.abs(weights)

print(abs_weights)

如果您查看原始的 weights ,则负系数意味着相应特征的较高值会将分类推向负类.

If you look at the original weights then a negative coefficient means that higher value of the corresponding feature pushes the classification more towards the negative class.

编辑1

显示如何获取功能名称的示例:

import numpy as np

#features names
names_of_variables =np.array(['a','b','c','d'])

#create random weights and get the magnitude
weights = np.random.rand(4)
abs_weights = np.abs(weights)

#get the sorting indices
sorted_index = np.argsort(abs_weights)[::-1]

#check if the sorting indices are correct
print(abs_weights[sorted_index])

#get the index of the top-2 features
top_2 = sorted_index[:2]

#get the names of the top 2 most important features
print(names_of_variables[top_2])

这篇关于如何使用权重在逻辑回归中获得特征重要性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆