scikit learn:如何检查系数的重要性 [英] scikit learn: how to check coefficients significance
问题描述
我尝试使用 SKLearn 对一个相当大的数据集进行 LR,该数据集具有约 600 个虚拟变量且只有很少的区间变量(以及我的数据集中的 300 K 行),结果混淆矩阵看起来很可疑.我想检查返回的系数和方差分析的重要性,但我找不到如何访问它.有可能吗?对于包含大量虚拟变量的数据,最佳策略是什么?非常感谢!
i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
推荐答案
Scikit-learn 故意不支持统计推断.如果您想要开箱即用的系数显着性检验(以及更多),您可以使用 Logit 来自 Statsmodels 的估算器.这个包模仿了 R 中的 glm
接口模型,所以你会觉得它很熟悉.
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm
models in R, so you could find it familiar.
如果你仍然想坚持使用 scikit-learn LogisticRegression,你可以使用渐近近似来估计最大似然估计的分布.准确地说,对于最大似然估计向量theta
,其方差-协方差矩阵可以估计为inverse(H)
,其中H
是theta
处的对数似然的 Hessian 矩阵.这正是下面的函数所做的:
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta
, its variance-covariance matrix can be estimated as inverse(H)
, where H
is the Hessian matrix of log-likelihood at theta
. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
print()
的输出是相同的,而且它们恰好是系数 p 值.
The outputs of print()
are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary()
还打印格式良好的 HTML 摘要.
sm_model.summary()
also prints a nicely formatted HTML summary.
这篇关于scikit learn:如何检查系数的重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!