Python scikit学习线性模型参数标准误差 [英] Python scikit learn Linear Model Parameter Standard Error

查看:32
本文介绍了Python scikit学习线性模型参数标准误差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 sklearn,特别是 linear_model 模块.在拟合一个简单的线性之后

将pandas导入为pd将 numpy 导入为 np从 sklearn 导入 linear_modelrandn = np.random.randnX = pd.DataFrame(randn(10,3), columns=['X1','X2','X3'])y = pd.DataFrame(randn(10,1), columns=['Y'])模型 = linear_model.LinearRegression()模型拟合(X=X,y=y)

我看到了如何通过 coef_ 和intercept_ 访问系数和拦截,预测也很简单.我想访问这个简单模型参数的方差-协方差矩阵,以及这些参数的标准误差.我熟悉 R 和 vcov() 函数,似乎 scipy.optimize 有一些功能(使用python中的optimize.leastsq方法获取拟合参数的标准误差) - sklearn是否具有访问这些统计数据的功能?>

感谢您对此的任何帮助.

-瑞恩

解决方案

tl;dr

不是使用 scikit-learn,但您可以使用一些线性代数手动计算.我在下面的例子中这样做.

还有一个带有此代码的 jupyter 笔记本:https://gist.github.com/grisaitis/cf481034bb41d318034bb43130a

是什么以及为什么

您估计的标准误差只是您估计方差的平方根.你估计的方差是多少?如果您假设您的模型存在高斯误差,则为:

Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat

然后beta_hat[i]的标准误差是Var(beta_hat)[i, i] ** 0.5.

您只需计算sigma_squared_hat.这是模型高斯误差的估计值.这不是先验已知的,但可以通过残差的样本方差进行估计.

您还需要在数据矩阵中添加一个截距项.Scikit-learn 使用 LinearRegression 类自动执行此操作.因此,要自己计算,您需要将其添加到 X 矩阵或数据框中.

如何

从你的代码开始,

展示你的 scikit-learn 结果

print(model.intercept_)打印(模型.系数_)

[-0.28671532][[ 0.17501115 -0.6928708 0.22336584]]

用线性代数重现这个

N = len(X)p = len(X.columns) + 1 # 加一,因为 LinearRegression 添加了截距项X_with_intercept = np.empty(shape=(N, p), dtype=np.float)X_with_intercept[:, 0] = 1X_with_intercept[:, 1:p] = X.valuesbeta_hat = np.linalg.inv(X_with_intercept.T@X_with_intercept)@X_with_intercept.T@y.values打印(beta_hat)

[[-0.28671532][0.17501115][-0.6928708 ][0.22336584]]

计算参数估计的标准误差

y_hat = model.predict(X)残差 = y.values - y_hatResidual_sum_of_squares = 残差.T @ 残差sigma_squared_hat=residual_sum_of_squares[0, 0]/(N - p)var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat对于范围内的 p_(p):标准错误 = var_beta_hat[p_, p_] ** 0.5打印(fSE(beta_hat[{p_}]):{standard_error}")

SE(beta_hat[0]): 0.2468580488280805SE(beta_hat[1]):0.2965501221823944SE(beta_hat[2]):0.3518847753610169SE(beta_hat[3]):0.3250760291745124

statsmodels

确认

import statsmodels.api as smols = sm.OLS(y.values, X_with_intercept)ols_result = ols.fit()ols_result.summary()

<代码>...==============================================================================coef std err t P>|t|[0.025 0.975]-------------------------------------------------------------------------------常量 -0.2867 0.247 -1.161 0.290 -0.891 0.317x1 0.1750 0.297 0.590 0.577 -0.551 0.901x2 -0.6929 0.352 -1.969 0.096 -1.554 0.168x3 0.2234 0.325 0.687 0.518 -0.572 1.019==============================================================================

耶,完成了!

I am working with sklearn and specifically the linear_model module. After fitting a simple linear as in

import pandas as pd
import numpy as np
from sklearn import linear_model
randn = np.random.randn

X = pd.DataFrame(randn(10,3), columns=['X1','X2','X3'])
y = pd.DataFrame(randn(10,1), columns=['Y'])        

model = linear_model.LinearRegression()
model.fit(X=X, y=y)

I see how I can access to coefficients and intercept via coef_ and intercept_, prediction is straightforward as well. I would like to access a variance-covariance matrix for the parameters of this simple model, and the standard error of these parameters. I am familiar with R and the vcov() function, and it seems that scipy.optimize has some functionality for this (Getting standard errors on fitted parameters using the optimize.leastsq method in python) - does sklearn have any functionality for accessing these statistics??

Appreciate any help on this.

-Ryan

解决方案

tl;dr

not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below.

also here's a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31

what and why

the standard errors of your estimates are just the square root of the variances of your estimates. what's the variance of your estimate? if you assume your model has gaussian error, it's:

Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat

and then the standard error of beta_hat[i] is Var(beta_hat)[i, i] ** 0.5.

All you have to compute sigma_squared_hat. This is the estimate of your model's gaussian error. This is not known a priori but can be estimated with the sample variance of your residuals.

Also you need to add an intercept term to your data matrix. Scikit-learn does this automatically with the LinearRegression class. So to compute this yourself you need to add that to your X matrix or dataframe.

how

Starting after your code,

show your scikit-learn results

print(model.intercept_)
print(model.coef_)

[-0.28671532]
[[ 0.17501115 -0.6928708   0.22336584]]

reproduce this with linear algebra

N = len(X)
p = len(X.columns) + 1  # plus one because LinearRegression adds an intercept term

X_with_intercept = np.empty(shape=(N, p), dtype=np.float)
X_with_intercept[:, 0] = 1
X_with_intercept[:, 1:p] = X.values

beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y.values
print(beta_hat)

[[-0.28671532]
 [ 0.17501115]
 [-0.6928708 ]
 [ 0.22336584]]

compute standard errors of the parameter estimates

y_hat = model.predict(X)
residuals = y.values - y_hat
residual_sum_of_squares = residuals.T @ residuals
sigma_squared_hat = residual_sum_of_squares[0, 0] / (N - p)
var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat
for p_ in range(p):
    standard_error = var_beta_hat[p_, p_] ** 0.5
    print(f"SE(beta_hat[{p_}]): {standard_error}")

SE(beta_hat[0]): 0.2468580488280805
SE(beta_hat[1]): 0.2965501221823944
SE(beta_hat[2]): 0.3518847753610169
SE(beta_hat[3]): 0.3250760291745124

confirm with statsmodels

import statsmodels.api as sm
ols = sm.OLS(y.values, X_with_intercept)
ols_result = ols.fit()
ols_result.summary()

...
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2867      0.247     -1.161      0.290      -0.891       0.317
x1             0.1750      0.297      0.590      0.577      -0.551       0.901
x2            -0.6929      0.352     -1.969      0.096      -1.554       0.168
x3             0.2234      0.325      0.687      0.518      -0.572       1.019
==============================================================================

yay, done!

这篇关于Python scikit学习线性模型参数标准误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆