statsmodel 线性回归 (ols) 的稳健性问题 - Python [英] Robustness issue of statsmodel Linear regression (ols) - Python
问题描述
我正在使用 Stats 模型测试一些基本的类别回归:我建立了一个确定性模型
I was testing some basic category regression using Stats model: I build up a deterministic model
Y = X + Z
其中 X 可以取 3 个值(a、b 或 c)而 Z 只能取 2 个值(d 或 e).在那个阶段,模型纯粹是确定性的,我按如下方式设置每个变量的权重
where X can takes 3 values (a, b or c) and Z only 2 (d or e). At that stage the model is purely deterministic, I setup the weights for each variable as followed
a 的权重=1
b 的权重=2
c 的权重=3
d 的权重=1
e 的权重=2
因此,如果 X=a,则 1(X=a) 为 1,否则为 0,模型很简单:
Therefore with 1(X=a) being 1 if X=a, 0 otherwise, the model is simply:
Y = 1(X=a) + 2*1(X=b) + 3*1(X=c) + 1(Z=d) + 2*1(Z=e)
Y = 1(X=a) + 2*1(X=b) + 3*1(X=c) + 1(Z=d) + 2*1(Z=e)
使用以下代码,生成不同的变量并运行回归
Using the following code, to generate the different variables and run the regression
from statsmodels.formula.api import ols
nbData = 1000
rand1 = np.random.uniform(size=nbData)
rand2 = np.random.uniform(size=nbData)
a = 1 * (rand1 <= (1.0/3.0))
b = 1 * (((1.0/3.0)< rand1) & (rand1< (4/5.0)))
c = 1-b-a
d = 1 * (rand2 <= (3.0/5.0))
e = 1-d
weigths = [1,2,3,1,2]
y = a+2*b+3*c+4*d+5*e
df = pd.DataFrame({'y':y, 'a':a, 'b':b, 'c':c, 'd':d, 'e':e})
mod = ols(formula='y ~ a + b + c + d + e - 1', data=df)
res = mod.fit()
print(res.summary())
我最终得到了正确的结果(必须考虑 coef 之间的区别,而不是 coef 本身)
I end up with the rights results (one has to look at the difference between coef rather than the coef themselfs)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.006e+30
Date: Wed, 16 Sep 2015 Prob (F-statistic): 0.00
Time: 03:05:40 Log-Likelihood: 3156.8
No. Observations: 100 AIC: -6306.
Df Residuals: 96 BIC: -6295.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
a 1.6000 7.47e-16 2.14e+15 0.000 1.600 1.600
b 2.6000 6.11e-16 4.25e+15 0.000 2.600 2.600
c 3.6000 9.61e-16 3.74e+15 0.000 3.600 3.600
d 3.4000 5.21e-16 6.52e+15 0.000 3.400 3.400
e 4.4000 6.85e-16 6.42e+15 0.000 4.400 4.400
==============================================================================
Omnibus: 11.299 Durbin-Watson: 0.833
Prob(Omnibus): 0.004 Jarque-Bera (JB): 5.720
Skew: -0.381 Prob(JB): 0.0573
Kurtosis: 2.110 Cond. No. 2.46e+15
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.67e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
但是当我将数据点的数量增加到(比如)600 时,回归产生了非常糟糕的结果.我在 Excel 和 R 中尝试过类似的回归,无论数据点的数量如何,它们都产生一致的结果.有谁知道 statsmodel ols 是否有一些限制来解释这种行为,还是我遗漏了什么?
But when I increase the number of data point to (say) 600, the regression is producing really bad results. I have tried similar regression in Excel and in R and they are producing consistent results whatever the number of data points. Does anyone know if there is some restriction on statsmodel ols explaining such behaviour or am I missing something?
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.167
Model: OLS Adj. R-squared: 0.161
Method: Least Squares F-statistic: 29.83
Date: Wed, 16 Sep 2015 Prob (F-statistic): 1.23e-22
Time: 03:08:04 Log-Likelihood: -701.02
No. Observations: 600 AIC: 1412.
Df Residuals: 595 BIC: 1434.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
a 5.8070 1.15e+13 5.05e-13 1.000 -2.26e+13 2.26e+13
b 6.4951 1.15e+13 5.65e-13 1.000 -2.26e+13 2.26e+13
c 6.9033 1.15e+13 6.01e-13 1.000 -2.26e+13 2.26e+13
d -1.1927 1.15e+13 -1.04e-13 1.000 -2.26e+13 2.26e+13
e -0.1685 1.15e+13 -1.47e-14 1.000 -2.26e+13 2.26e+13
==============================================================================
Omnibus: 67.153 Durbin-Watson: 0.328
Prob(Omnibus): 0.000 Jarque-Bera (JB): 70.964
Skew: 0.791 Prob(JB): 3.89e-16
Kurtosis: 2.419 Cond. No. 7.70e+14
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.25e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
推荐答案
看来,正如 F 先生所提到的,主要问题是 statsmodel OLS 似乎没有像 Excel/R 那样处理共线性 pb那种情况,但是如果不是为每个 a、b、c、d 和 e
定义一个变量,而是定义一个变量 X
和一个 Z
> 可以等于 a, b or c
和 d or e
resp,然后回归工作正常.即更新代码:
It appears that as mentionned by Mr. F, the main problem is that the statsmodel OLS does not seem to handle the collinearity pb as well as Excel/R in that case, but if instead of defining one variable for each a, b, c, d and e
, one define a variable X
and one Z
which can be equal to a, b or c
and d or e
resp, then the regression works fine. Ie updating the code with :
df['X'] = ['c']*len(df)
df.X[df.b!=0] = 'b'
df.X[df.a!=0] = 'a'
df['Z'] = ['e']*len(df)
df.Z[df.d!=0] = 'd'
mod = ols(formula='y ~ X + Z - 1', data=df)
导致预期结果
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.684e+27
Date: Thu, 17 Sep 2015 Prob (F-statistic): 0.00
Time: 06:22:43 Log-Likelihood: 2.5096e+06
No. Observations: 100000 AIC: -5.019e+06
Df Residuals: 99996 BIC: -5.019e+06
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
X[a] 5.0000 1.85e-14 2.7e+14 0.000 5.000 5.000
X[b] 6.0000 1.62e-14 3.71e+14 0.000 6.000 6.000
X[c] 7.0000 2.31e-14 3.04e+14 0.000 7.000 7.000
Z[T.e] 1.0000 1.97e-14 5.08e+13 0.000 1.000 1.000
==============================================================================
Omnibus: 145.367 Durbin-Watson: 1.353
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9729.487
Skew: -0.094 Prob(JB): 0.00
Kurtosis: 1.483 Cond. No. 2.29
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
这篇关于statsmodel 线性回归 (ols) 的稳健性问题 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!