scikit学习:如何检查系数的意义 [英] scikit learn: how to check coefficients significance

查看：187 发布时间：2020/5/4 3:16:36 scikit-learn logistic-regression anova dummy-data

本文介绍了scikit学习:如何检查系数的意义的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试使用SKLearn对具有约600个虚拟对象和仅几个区间变量(以及我的数据集中的300 K行)的相当大的数据集进行LR，结果混淆矩阵看起来很可疑.我想检查返回的系数和方差分析的重要性，但找不到如何访问它.有可能吗?包含大量虚拟变量的数据的最佳策略是什么?非常感谢！

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!

推荐答案

Scikit学习故意不支持统计推断.如果您想要开箱即用的系数显着性检验(以及更多)，可以使用统计模型中的rel ="noreferrer"> Logit 估算器.该程序包模仿R中的接口glm模型，因此您会发现它很熟悉.

Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.

如果您仍要坚持使用scikit学习LogisticRegression，则可以使用渐近逼近来分布最大似然估计.精确地，对于最大似然估计值theta的向量，其方差-协方差矩阵可以估计为inverse(H)，其中H是theta处对数似然的Hessian矩阵.这正是下面的功能:

If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:

import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression

def logit_pvalue(model, x):
    """ Calculate z-scores for scikit-learn LogisticRegression.
    parameters:
        model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
        x:     matrix on which the model was fit
    This function uses asymtptics for maximum likelihood estimates.
    """
    p = model.predict_proba(x)
    n = len(p)
    m = len(model.coef_[0]) + 1
    coefs = np.concatenate([model.intercept_, model.coef_[0]])
    x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
    ans = np.zeros((m, m))
    for i in range(n):
        ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
    vcov = np.linalg.inv(np.matrix(ans))
    se = np.sqrt(np.diag(vcov))
    t =  coefs/se  
    p = (1 - norm.cdf(abs(t))) * 2
    return p

# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))

# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()

print()的输出是相同的，它们恰好是系数p值.

The outputs of print() are identical, and they happen to be coefficient p-values.

[ 0.11413093  0.08779978]
[ 0.11413093  0.08779979]

sm_model.summary()还会打印格式正确的HTML摘要.

sm_model.summary() also prints a nicely formatted HTML summary.

这篇关于scikit学习:如何检查系数的意义的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

scikit学习:如何检查系数的意义 [英] scikit learn: how to check coefficients significance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

scikit学习:如何检查系数的意义 [英] scikit learn: how to check coefficients significance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭