Python scikit学习线性模型参数标准误差 [英] Python scikit learn Linear Model Parameter Standard Error
问题描述
我正在使用 sklearn,特别是 linear_model 模块.在拟合一个简单的线性之后
将pandas导入为pd将 numpy 导入为 np从 sklearn 导入 linear_modelrandn = np.random.randnX = pd.DataFrame(randn(10,3), columns=['X1','X2','X3'])y = pd.DataFrame(randn(10,1), columns=['Y'])模型 = linear_model.LinearRegression()模型拟合(X=X,y=y)
我看到了如何通过 coef_ 和intercept_ 访问系数和拦截,预测也很简单.我想访问这个简单模型参数的方差-协方差矩阵,以及这些参数的标准误差.我熟悉 R 和 vcov() 函数,似乎 scipy.optimize 有一些功能(使用python中的optimize.leastsq方法获取拟合参数的标准误差) - sklearn是否具有访问这些统计数据的功能?>
感谢您对此的任何帮助.
-瑞恩
tl;dr
不是使用 scikit-learn,但您可以使用一些线性代数手动计算.我在下面的例子中这样做.
还有一个带有此代码的 jupyter 笔记本:https://gist.github.com/grisaitis/cf481034bb41d318034bb43130a
是什么以及为什么
您估计的标准误差只是您估计方差的平方根.你估计的方差是多少?如果您假设您的模型存在高斯误差,则为:
Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat
然后beta_hat[i]
的标准误差是Var(beta_hat)[i, i] ** 0.5
.
您只需计算sigma_squared_hat
.这是模型高斯误差的估计值.这不是先验已知的,但可以通过残差的样本方差进行估计.
您还需要在数据矩阵中添加一个截距项.Scikit-learn 使用 LinearRegression
类自动执行此操作.因此,要自己计算,您需要将其添加到 X 矩阵或数据框中.
如何
从你的代码开始,
展示你的 scikit-learn 结果
print(model.intercept_)打印(模型.系数_)
[-0.28671532][[ 0.17501115 -0.6928708 0.22336584]]
用线性代数重现这个
N = len(X)p = len(X.columns) + 1 # 加一,因为 LinearRegression 添加了截距项X_with_intercept = np.empty(shape=(N, p), dtype=np.float)X_with_intercept[:, 0] = 1X_with_intercept[:, 1:p] = X.valuesbeta_hat = np.linalg.inv(X_with_intercept.T@X_with_intercept)@X_with_intercept.T@y.values打印(beta_hat)
[[-0.28671532][0.17501115][-0.6928708 ][0.22336584]]
计算参数估计的标准误差
y_hat = model.predict(X)残差 = y.values - y_hatResidual_sum_of_squares = 残差.T @ 残差sigma_squared_hat=residual_sum_of_squares[0, 0]/(N - p)var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat对于范围内的 p_(p):标准错误 = var_beta_hat[p_, p_] ** 0.5打印(fSE(beta_hat[{p_}]):{standard_error}")
SE(beta_hat[0]): 0.2468580488280805SE(beta_hat[1]):0.2965501221823944SE(beta_hat[2]):0.3518847753610169SE(beta_hat[3]):0.3250760291745124
用statsmodels
确认import statsmodels.api as smols = sm.OLS(y.values, X_with_intercept)ols_result = ols.fit()ols_result.summary()
<代码>...==============================================================================coef std err t P>|t|[0.025 0.975]-------------------------------------------------------------------------------常量 -0.2867 0.247 -1.161 0.290 -0.891 0.317x1 0.1750 0.297 0.590 0.577 -0.551 0.901x2 -0.6929 0.352 -1.969 0.096 -1.554 0.168x3 0.2234 0.325 0.687 0.518 -0.572 1.019==============================================================================
耶,完成了!
I am working with sklearn and specifically the linear_model module. After fitting a simple linear as in
import pandas as pd
import numpy as np
from sklearn import linear_model
randn = np.random.randn
X = pd.DataFrame(randn(10,3), columns=['X1','X2','X3'])
y = pd.DataFrame(randn(10,1), columns=['Y'])
model = linear_model.LinearRegression()
model.fit(X=X, y=y)
I see how I can access to coefficients and intercept via coef_ and intercept_, prediction is straightforward as well. I would like to access a variance-covariance matrix for the parameters of this simple model, and the standard error of these parameters. I am familiar with R and the vcov() function, and it seems that scipy.optimize has some functionality for this (Getting standard errors on fitted parameters using the optimize.leastsq method in python) - does sklearn have any functionality for accessing these statistics??
Appreciate any help on this.
-Ryan
tl;dr
not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below.
also here's a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31
what and why
the standard errors of your estimates are just the square root of the variances of your estimates. what's the variance of your estimate? if you assume your model has gaussian error, it's:
Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat
and then the standard error of beta_hat[i]
is Var(beta_hat)[i, i] ** 0.5
.
All you have to compute sigma_squared_hat
. This is the estimate of your model's gaussian error. This is not known a priori but can be estimated with the sample variance of your residuals.
Also you need to add an intercept term to your data matrix. Scikit-learn does this automatically with the LinearRegression
class. So to compute this yourself you need to add that to your X matrix or dataframe.
how
Starting after your code,
show your scikit-learn results
print(model.intercept_)
print(model.coef_)
[-0.28671532]
[[ 0.17501115 -0.6928708 0.22336584]]
reproduce this with linear algebra
N = len(X)
p = len(X.columns) + 1 # plus one because LinearRegression adds an intercept term
X_with_intercept = np.empty(shape=(N, p), dtype=np.float)
X_with_intercept[:, 0] = 1
X_with_intercept[:, 1:p] = X.values
beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y.values
print(beta_hat)
[[-0.28671532]
[ 0.17501115]
[-0.6928708 ]
[ 0.22336584]]
compute standard errors of the parameter estimates
y_hat = model.predict(X)
residuals = y.values - y_hat
residual_sum_of_squares = residuals.T @ residuals
sigma_squared_hat = residual_sum_of_squares[0, 0] / (N - p)
var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat
for p_ in range(p):
standard_error = var_beta_hat[p_, p_] ** 0.5
print(f"SE(beta_hat[{p_}]): {standard_error}")
SE(beta_hat[0]): 0.2468580488280805
SE(beta_hat[1]): 0.2965501221823944
SE(beta_hat[2]): 0.3518847753610169
SE(beta_hat[3]): 0.3250760291745124
confirm with statsmodels
import statsmodels.api as sm
ols = sm.OLS(y.values, X_with_intercept)
ols_result = ols.fit()
ols_result.summary()
...
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.2867 0.247 -1.161 0.290 -0.891 0.317
x1 0.1750 0.297 0.590 0.577 -0.551 0.901
x2 -0.6929 0.352 -1.969 0.096 -1.554 0.168
x3 0.2234 0.325 0.687 0.518 -0.572 1.019
==============================================================================
yay, done!
这篇关于Python scikit学习线性模型参数标准误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!