如何找到逻辑回归模型的特征的重要性? [英] How to find the importance of the features for a logistic regression model?
问题描述
我有一个通过逻辑回归算法训练的二元预测模型.我想知道哪些特征(预测器)对于正类或负类的决定更重要.我知道有 coef_
参数来自 scikit-learn 包,但我不知道它是否足够重要.另一件事是我如何根据负类和正类的重要性来评估 coef_
值.我还阅读了标准化回归系数,但我不知道它是什么.
I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_
parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_
values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.
假设有一些特征,如肿瘤的大小、肿瘤的重量等,可以决定一个测试用例是恶性还是非恶性.我想知道哪些特征对于恶性和非恶性预测更重要.有意义吗?
Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?
推荐答案
在线性分类模型(逻辑类就是其中之一)中感受给定参数的影响"的最简单选项之一是考虑其系数的大小乘以数据中相应参数的标准差.
One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.
考虑这个例子:
import numpy as np
from sklearn.linear_model import LogisticRegression
x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])
m = LogisticRegression()
m.fit(X, y)
# The estimated coefficients will all be around 1:
print(m.coef_)
# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)
获得类似结果的另一种方法是检查模型拟合标准化参数的系数:
An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:
m.fit(X / np.std(X, 0), y)
print(m.coef_)
请注意,这是最基本的方法,并且存在许多其他用于查找特征重要性或参数影响的技术(使用 p 值、引导分数、各种判别指数"等).
Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).
我很确定你会在 https://stats.stackexchange.com/ 上得到更多有趣的答案.
I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.
这篇关于如何找到逻辑回归模型的特征的重要性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!