Python statsmodels中缺少OLS回归模型的拦截 [英] Missing intercepts of OLS Regression models in Python statsmodels
问题描述
我正在对此中找到的数据集的100个窗口OLS regression estimation
进行滚动链接( https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk ),格式如下.
I am running a rolling for example of 100 window OLS regression estimation
of the dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format.
time X Y
0.000543 0 10
0.000575 0 10
0.041324 1 10
0.041331 2 10
0.041336 3 10
0.04134 4 10
...
9.987735 55 239
9.987739 56 239
9.987744 57 239
9.987749 58 239
9.987938 59 239
数据集中的第三列(Y)是我的真实值-这就是我想要预测(估计)的值.我想进行Y
的预测(即根据X
的前3个滚动值预测Y
的当前值.为此,我使用statsmodels
进行以下python
脚本工作.
The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate). I want to do a prediction of Y
(i.e. predict the current value of Y
according to the previous 3 rolling values of X
. For this, I have the following python
script work using statsmodels
.
# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm
df=pd.read_csv('estimated_pred.csv')
df=df.dropna() # to drop nans in case there are any
window = 100
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
temp=df.iloc[i-window:i,:]
RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')).fit()
df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]
# Predicted values in a row
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']
#print(df['predicted'])
print(temp)
这给了我以下格式的示例输出.
Which gives me a sample output of the following format.
time X Y a b1 b2 predicted
0 0.000543 0 10 None None None NaN
1 0.000575 0 10 None None None NaN
2 0.041324 1 10 None None None NaN
3 0.041331 2 10 None None None NaN
4 0.041336 3 10 None None None NaN
.. ... .. .. ... ... ... ...
50 0.041340 4 10 10 0 1.55431e-15 NaN
51 0.041345 5 10 10 1.7053e-13 7.77156e-16 10
52 0.041350 6 10 10 1.74623e-09 -7.99361e-15 10
53 0.041354 7 10 10 6.98492e-10 -6.21725e-15 10
.. ... .. .. ... ... ... ...
509 0.160835 38 20 20 4.88944e-09 -1.15463e-14 20
510 0.160839 39 20 20 1.86265e-09 5.32907e-15 20
.. ... .. .. ... ... ... ...
最后,我想包括所有预测值(OLS
回归分析的摘要)的均方误差(MSE
).例如,如果我们查看第5行,则X
的值为2,而Y
的值为10.假设当前行中的y
的预测值为6,因此mse
将是(10-6)^2
.当我们执行print (RollOLS.summary())
时,sm.OLS
返回此类<class 'statsmodels.regression.linear_model.OLS'>
的实例.
Finally, I want to include the mean squared error (MSE
) for all the prediction (a summary of the OLS
regression analysis) values. For example, if we look at row 5, the value of X
is 2 and the value of Y
is 10. Let's say the prediction value of y
at the current row is 6 and therefore the mse
will be (10-6)^2
. The sm.OLS
returns an instance of this class <class 'statsmodels.regression.linear_model.OLS'>
when we do print (RollOLS.summary())
.
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: -inf
Model: OLS Adj. R-squared: -inf
Method: Least Squares F-statistic: -48.50
Date: Tue, 04 Jul 2017 Prob (F-statistic): 1.00
Time: 22:19:18 Log-Likelihood: 2359.7
No. Observations: 100 AIC: -4713.
Df Residuals: 97 BIC: -4706.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 239.0000 2.58e-09 9.26e+10 0.000 239.000 239.000
time 4.547e-13 2.58e-10 0.002 0.999 -5.12e-10 5.13e-10
X -3.886e-16 1.1e-13 -0.004 0.997 -2.19e-13 2.19e-13
==============================================================================
Omnibus: 44.322 Durbin-Watson: 0.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.471
Skew: -1.886 Prob(JB): 1.67e-19
Kurtosis: 5.556 Cond. No. 9.72e+04
==============================================================================
但是rsquared
(print (RollOLS.rsquared))
的值应该在0
和1
之间,而不是-inf
之间,这似乎是missing intercepts
的问题.如果我们想打印mse
,我们按照
But the value of rsquared
(print (RollOLS.rsquared))
, for example, should have been between 0
and 1
instead of -inf
and this seems to be the issue with missing intercepts
. If we want to print the mse
, we do print (RollOLS.mse_model)
... etc as per the documentation. How can we add the intercepts
and print the regression statistics with the correct values as we do for the predicted values? What am I doing wrong in here? Or is there another way of doing this using scikit-learn
libraries?
推荐答案
简短回答
只要y
在回归窗口内保持恒定(在您的情况下为100个观察值),r^2
的值将为+/- inf
.您可以在下面找到更多详细信息,但直觉是r^2
是y
的方差的一部分,由X
解释:如果y
的方差为零,则r^2
的定义不明确.
The value of r^2
is going to be +/- inf
as long as y
remains constant over the regression window (100 observations in your case). You can find more details below, but intuition is that r^2
is the proportion of y
's variance explained by X
: if y
's variance is zero, r^2
is simply not well defined.
Possible solution: Try to use a longer window, or resample Y and X so that Y does not remain constant for so many consecutive observations.
长期回答
老实说,我认为这不是回归的正确数据集. 这是数据的简单图解:
Looking at this I honestly think this is not the right dataset for the regression. This is a simple plot of the data:
X和时间的线性组合可以解释Y吗?嗯...看起来不太合理. Y几乎看起来像一个离散变量,因此您可能希望查看逻辑回归.
Does a linear combination of X and time explain Y? Mmm...doesn't look plausible. Y almost looks like a discrete variable, so you probably want to look at logistic regressions.
要问您的问题,R ^ 2是可从自变量预测的因变量中方差的比例". 来自维基百科:
To come to your question, the R^2 is the "the proportion of the variance in the dependent variable that is predictable from the independent variable(s)". From wikipedia:
在您的情况下,Y在100个观测值中很可能是恒定,因此它的方差为0,因此除以零就是inf.
In your case it is very likely that Y is constant over 100 observations, hence it has 0 variance, that produces a division by zero hence the inf.
因此,恐怕您不应该查看代码中的修复程序,而应该重新考虑问题和拟合数据的方式.
So I am afraid you should not look to fixes in the code, but you should rethink the problem and the way of fitting the data.
这篇关于Python statsmodels中缺少OLS回归模型的拦截的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!