Python scikit学习线性模型参数标准错误 [英] Python scikit learn Linear Model Parameter Standard Error

查看:75
本文介绍了Python scikit学习线性模型参数标准错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 sklearn,特别是 linear_model 模块.在拟合了简单的线性后,如

 将pandas导入为pd将numpy导入为np从sklearn导入linear_modelrandn = np.random.randnX = pd.DataFrame(randn(10,3),列= ['X1','X2','X3'])y = pd.DataFrame(randn(10,1),columns = ['Y'])模型= linear_model.LinearRegression()model.fit(X = X,y = y) 

我看到如何通过coef_和intercept_访问系数并进行截取,预测也很简单.我想访问此简单模型的参数以及这些参数的标准误差的方差-协方差矩阵.我熟悉R和vcov()函数,并且scipy.optimize对此具有一些功能(解决方案

tl; dr

不适用于scikit-learn,但是您可以使用一些线性代数来手动计算.我为您在下面的示例中执行此操作.

这也是一个使用此代码的Jupyter笔记本: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31

什么和为什么

您估计的标准误差只是您估计方差的平方根.您的估算差异是多少?如果您认为模型存在高斯误差,则为:

Var(beta_hat)=倒数(X.T @ X)* sigma_squared_hat

然后beta_hat[i]的标准误差是Var(beta_hat)[i, i] ** 0.5.

所有您需要计算的 sigma_squared_hat .这是模型高斯误差的估计值.先验未知,但可以通过残差的样本方差来估计.

此外,您需要在数据矩阵中添加一个拦截项.Scikit-learn使用 LinearRegression 类自动执行此操作.因此,要自己计算,需要将其添加到X矩阵或数据帧中.

如何

从您的代码开始,

显示您的scikit学习结果

  print(model.intercept_)打印(model.coef_) 

  [-0.28671532][[0.17501115 -0.6928708 0.22336584]] 

用线性代数重现这个

  N = len(X)p = len(X.columns)+ 1#加一,因为LinearRegression添加了一个截距项X_with_intercept = np.empty(shape =(N,p),dtype = np.float)X_with_intercept [:, 0] = 1X_with_intercept [:, 1:p] = X.valuesbeta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept)@ X_with_intercept.T @ y.values打印(beta_hat) 

  [[-0.28671532][0.17501115][-0.6928708][0.22336584] 

计算参数估计值的标准误差

  y_hat = model.predict(X)残差 = y.values - y_hatzero_sum_of_squares =残差.T @残差sigma_squared_hat =残差平方和[0,0]/(N-p)var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept)* sigma_squared_hat对于范围(p)中的p_:standard_error = var_beta_hat [p_,p_] ** 0.5print(f"SE(beta_hat [{p_}]):{standard_error}") 

  SE(beta_hat [0]):0.2468580488280805SE(beta_hat [1]):0.2965501221823944SE(beta_hat [2]):0.3518847753610169SE(beta_hat [3]):0.3250760291745124 

使用 statsmodels

进行确认

 将statsmodels.api导入为smols = sm.OLS(y.values,X_with_intercept)ols_result = ols.fit()ols_result.summary() 

  ...=============================================================================coef std err t P> | t |[0.025 0.975]------------------------------------------------------------------------------const -0.2867 0.247 -1.161 0.290 -0.891 0.317x1 0.1750 0.297 0.590 0.577 -0.551 0.901x2 -0.6929 0.352 -1.969 0.096 -1.554 0.168x3 0.2234 0.325 0.687 0.518 -0.572 1.019============================================================================= 

是的,完成了!

I am working with sklearn and specifically the linear_model module. After fitting a simple linear as in

import pandas as pd
import numpy as np
from sklearn import linear_model
randn = np.random.randn

X = pd.DataFrame(randn(10,3), columns=['X1','X2','X3'])
y = pd.DataFrame(randn(10,1), columns=['Y'])        

model = linear_model.LinearRegression()
model.fit(X=X, y=y)

I see how I can access to coefficients and intercept via coef_ and intercept_, prediction is straightforward as well. I would like to access a variance-covariance matrix for the parameters of this simple model, and the standard error of these parameters. I am familiar with R and the vcov() function, and it seems that scipy.optimize has some functionality for this (Getting standard errors on fitted parameters using the optimize.leastsq method in python) - does sklearn have any functionality for accessing these statistics??

Appreciate any help on this.

-Ryan

解决方案

tl;dr

not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below.

also here's a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31

what and why

the standard errors of your estimates are just the square root of the variances of your estimates. what's the variance of your estimate? if you assume your model has gaussian error, it's:

Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat

and then the standard error of beta_hat[i] is Var(beta_hat)[i, i] ** 0.5.

All you have to compute sigma_squared_hat. This is the estimate of your model's gaussian error. This is not known a priori but can be estimated with the sample variance of your residuals.

Also you need to add an intercept term to your data matrix. Scikit-learn does this automatically with the LinearRegression class. So to compute this yourself you need to add that to your X matrix or dataframe.

how

Starting after your code,

show your scikit-learn results

print(model.intercept_)
print(model.coef_)

[-0.28671532]
[[ 0.17501115 -0.6928708   0.22336584]]

reproduce this with linear algebra

N = len(X)
p = len(X.columns) + 1  # plus one because LinearRegression adds an intercept term

X_with_intercept = np.empty(shape=(N, p), dtype=np.float)
X_with_intercept[:, 0] = 1
X_with_intercept[:, 1:p] = X.values

beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y.values
print(beta_hat)

[[-0.28671532]
 [ 0.17501115]
 [-0.6928708 ]
 [ 0.22336584]]

compute standard errors of the parameter estimates

y_hat = model.predict(X)
residuals = y.values - y_hat
residual_sum_of_squares = residuals.T @ residuals
sigma_squared_hat = residual_sum_of_squares[0, 0] / (N - p)
var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat
for p_ in range(p):
    standard_error = var_beta_hat[p_, p_] ** 0.5
    print(f"SE(beta_hat[{p_}]): {standard_error}")

SE(beta_hat[0]): 0.2468580488280805
SE(beta_hat[1]): 0.2965501221823944
SE(beta_hat[2]): 0.3518847753610169
SE(beta_hat[3]): 0.3250760291745124

confirm with statsmodels

import statsmodels.api as sm
ols = sm.OLS(y.values, X_with_intercept)
ols_result = ols.fit()
ols_result.summary()

...
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2867      0.247     -1.161      0.290      -0.891       0.317
x1             0.1750      0.297      0.590      0.577      -0.551       0.901
x2            -0.6929      0.352     -1.969      0.096      -1.554       0.168
x3             0.2234      0.325      0.687      0.518      -0.572       1.019
==============================================================================

yay, done!

这篇关于Python scikit学习线性模型参数标准错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆