Python scikit学习线性模型参数标准错误 [英] Python scikit learn Linear Model Parameter Standard Error
问题描述
我正在使用 sklearn,特别是 linear_model 模块.在拟合了简单的线性后,如
将pandas导入为pd将numpy导入为np从sklearn导入linear_modelrandn = np.random.randnX = pd.DataFrame(randn(10,3),列= ['X1','X2','X3'])y = pd.DataFrame(randn(10,1),columns = ['Y'])模型= linear_model.LinearRegression()model.fit(X = X,y = y)
我看到如何通过coef_和intercept_访问系数并进行截取,预测也很简单.我想访问此简单模型的参数以及这些参数的标准误差的方差-协方差矩阵.我熟悉R和vcov()函数,并且scipy.optimize对此具有一些功能(解决方案
tl; dr
不适用于scikit-learn,但是您可以使用一些线性代数来手动计算.我为您在下面的示例中执行此操作.
这也是一个使用此代码的Jupyter笔记本: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31
什么和为什么
您估计的标准误差只是您估计方差的平方根.您的估算差异是多少?如果您认为模型存在高斯误差,则为:
Var(beta_hat)=倒数(X.T @ X)* sigma_squared_hat
然后beta_hat[i]
的标准误差是Var(beta_hat)[i, i] ** 0.5
.
所有您需要计算的 sigma_squared_hat
.这是模型高斯误差的估计值.先验未知,但可以通过残差的样本方差来估计.
此外,您需要在数据矩阵中添加一个拦截项.Scikit-learn使用 LinearRegression
类自动执行此操作.因此,要自己计算,需要将其添加到X矩阵或数据帧中.
如何
从您的代码开始,
显示您的scikit学习结果
print(model.intercept_)打印(model.coef_)
[-0.28671532][[0.17501115 -0.6928708 0.22336584]]
用线性代数重现这个
N = len(X)p = len(X.columns)+ 1#加一,因为LinearRegression添加了一个截距项X_with_intercept = np.empty(shape =(N,p),dtype = np.float)X_with_intercept [:, 0] = 1X_with_intercept [:, 1:p] = X.valuesbeta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept)@ X_with_intercept.T @ y.values打印(beta_hat)
[[-0.28671532][0.17501115][-0.6928708][0.22336584]
计算参数估计值的标准误差
y_hat = model.predict(X)残差 = y.values - y_hatzero_sum_of_squares =残差.T @残差sigma_squared_hat =残差平方和[0,0]/(N-p)var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept)* sigma_squared_hat对于范围(p)中的p_:standard_error = var_beta_hat [p_,p_] ** 0.5print(f"SE(beta_hat [{p_}]):{standard_error}")
SE(beta_hat [0]):0.2468580488280805SE(beta_hat [1]):0.2965501221823944SE(beta_hat [2]):0.3518847753610169SE(beta_hat [3]):0.3250760291745124
使用 statsmodels
进行确认 将statsmodels.api导入为smols = sm.OLS(y.values,X_with_intercept)ols_result = ols.fit()ols_result.summary()
...=============================================================================coef std err t P> | t |[0.025 0.975]------------------------------------------------------------------------------const -0.2867 0.247 -1.161 0.290 -0.891 0.317x1 0.1750 0.297 0.590 0.577 -0.551 0.901x2 -0.6929 0.352 -1.969 0.096 -1.554 0.168x3 0.2234 0.325 0.687 0.518 -0.572 1.019=============================================================================
是的,完成了!
I am working with sklearn and specifically the linear_model module. After fitting a simple linear as in
import pandas as pd
import numpy as np
from sklearn import linear_model
randn = np.random.randn
X = pd.DataFrame(randn(10,3), columns=['X1','X2','X3'])
y = pd.DataFrame(randn(10,1), columns=['Y'])
model = linear_model.LinearRegression()
model.fit(X=X, y=y)
I see how I can access to coefficients and intercept via coef_ and intercept_, prediction is straightforward as well. I would like to access a variance-covariance matrix for the parameters of this simple model, and the standard error of these parameters. I am familiar with R and the vcov() function, and it seems that scipy.optimize has some functionality for this (Getting standard errors on fitted parameters using the optimize.leastsq method in python) - does sklearn have any functionality for accessing these statistics??
Appreciate any help on this.
-Ryan
tl;dr
not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below.
also here's a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31
what and why
the standard errors of your estimates are just the square root of the variances of your estimates. what's the variance of your estimate? if you assume your model has gaussian error, it's:
Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat
and then the standard error of beta_hat[i]
is Var(beta_hat)[i, i] ** 0.5
.
All you have to compute sigma_squared_hat
. This is the estimate of your model's gaussian error. This is not known a priori but can be estimated with the sample variance of your residuals.
Also you need to add an intercept term to your data matrix. Scikit-learn does this automatically with the LinearRegression
class. So to compute this yourself you need to add that to your X matrix or dataframe.
how
Starting after your code,
show your scikit-learn results
print(model.intercept_)
print(model.coef_)
[-0.28671532]
[[ 0.17501115 -0.6928708 0.22336584]]
reproduce this with linear algebra
N = len(X)
p = len(X.columns) + 1 # plus one because LinearRegression adds an intercept term
X_with_intercept = np.empty(shape=(N, p), dtype=np.float)
X_with_intercept[:, 0] = 1
X_with_intercept[:, 1:p] = X.values
beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y.values
print(beta_hat)
[[-0.28671532]
[ 0.17501115]
[-0.6928708 ]
[ 0.22336584]]
compute standard errors of the parameter estimates
y_hat = model.predict(X)
residuals = y.values - y_hat
residual_sum_of_squares = residuals.T @ residuals
sigma_squared_hat = residual_sum_of_squares[0, 0] / (N - p)
var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat
for p_ in range(p):
standard_error = var_beta_hat[p_, p_] ** 0.5
print(f"SE(beta_hat[{p_}]): {standard_error}")
SE(beta_hat[0]): 0.2468580488280805
SE(beta_hat[1]): 0.2965501221823944
SE(beta_hat[2]): 0.3518847753610169
SE(beta_hat[3]): 0.3250760291745124
confirm with statsmodels
import statsmodels.api as sm
ols = sm.OLS(y.values, X_with_intercept)
ols_result = ols.fit()
ols_result.summary()
...
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.2867 0.247 -1.161 0.290 -0.891 0.317
x1 0.1750 0.297 0.590 0.577 -0.551 0.901
x2 -0.6929 0.352 -1.969 0.096 -1.554 0.168
x3 0.2234 0.325 0.687 0.518 -0.572 1.019
==============================================================================
yay, done!
这篇关于Python scikit学习线性模型参数标准错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!