Python statsmodels 中缺少 OLS 回归模型的截距 [英] Missing intercepts of OLS Regression models in Python statsmodels

查看:164
本文介绍了Python statsmodels 中缺少 OLS 回归模型的截距的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行

X 和时间的线性组合能解释 Y 吗?嗯……看起来不太靠谱.Y 几乎看起来像一个离散变量,所以你可能想看看

在您的情况下,Y 很可能 常数 超过 100 个观测值,因此它的方差为 0,产生除以零因此 inf.

所以恐怕你不应该在代码中寻找修复,但你应该重新考虑问题和拟合数据的方式.

I am running a rolling for example of 100 window OLS regression estimation of the dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format.

 time     X   Y
0.000543  0  10
0.000575  0  10
0.041324  1  10
0.041331  2  10
0.041336  3  10
0.04134   4  10
  ...
9.987735  55 239
9.987739  56 239
9.987744  57 239
9.987749  58 239
9.987938  59 239

The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate). I want to do a prediction of Y (i.e. predict the current value of Y according to the previous 3 rolling values of X. For this, I have the following python script work using statsmodels.

# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm


df=pd.read_csv('estimated_pred.csv')    
df=df.dropna() # to drop nans in case there are any
window = 100
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
    temp=df.iloc[i-window:i,:]
    RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')).fit()
    df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
    df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
    df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]

# Predicted values in a row
 df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']

#print(df['predicted'])

print(temp)

Which gives me a sample output of the following format.

         time   X   Y        a           b1           b2  predicted
0    0.000543   0  10     None         None         None       NaN
1    0.000575   0  10     None         None         None       NaN
2    0.041324   1  10     None         None         None       NaN
3    0.041331   2  10     None         None         None       NaN
4    0.041336   3  10     None         None         None       NaN
..        ...  ..  ..      ...          ...          ...       ...
50    0.041340   4  10       10            0  1.55431e-15       NaN
51    0.041345   5  10       10   1.7053e-13  7.77156e-16        10
52    0.041350   6  10       10  1.74623e-09 -7.99361e-15        10
53    0.041354   7  10       10  6.98492e-10 -6.21725e-15        10
..        ...  ..  ..      ...          ...          ...       ...
509  0.160835  38  20       20  4.88944e-09 -1.15463e-14        20
510  0.160839  39  20       20  1.86265e-09  5.32907e-15        20
..        ...  ..  ..      ...          ...          ...       ...

Finally, I want to include the mean squared error (MSE) for all the prediction (a summary of the OLS regression analysis) values. For example, if we look at row 5, the value of X is 2 and the value of Y is 10. Let's say the prediction value of y at the current row is 6 and therefore the mse will be (10-6)^2. The sm.OLS returns an instance of this class <class 'statsmodels.regression.linear_model.OLS'> when we do print (RollOLS.summary()).

OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -48.50
Date:                Tue, 04 Jul 2017   Prob (F-statistic):               1.00
Time:                        22:19:18   Log-Likelihood:                 2359.7
No. Observations:                 100   AIC:                            -4713.
Df Residuals:                      97   BIC:                            -4706.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        239.0000   2.58e-09   9.26e+10      0.000       239.000   239.000
time        4.547e-13   2.58e-10      0.002      0.999     -5.12e-10  5.13e-10
X          -3.886e-16    1.1e-13     -0.004      0.997     -2.19e-13  2.19e-13
==============================================================================
Omnibus:                       44.322   Durbin-Watson:                   0.000
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               86.471
Skew:                          -1.886   Prob(JB):                     1.67e-19
Kurtosis:                       5.556   Cond. No.                     9.72e+04
==============================================================================

But the value of rsquared(print (RollOLS.rsquared)), for example, should have been between 0 and 1 instead of -inf and this seems to be the issue with missing intercepts. If we want to print the mse, we do print (RollOLS.mse_model)... etc as per the documentation. How can we add the intercepts and print the regression statistics with the correct values as we do for the predicted values? What am I doing wrong in here? Or is there another way of doing this using scikit-learnlibraries?

解决方案

Short Answer

The value of r^2 is going to be +/- inf as long as y remains constant over the regression window (100 observations in your case). You can find more details below, but intuition is that r^2 is the proportion of y's variance explained by X: if y's variance is zero, r^2 is simply not well defined.

Possible solution: Try to use a longer window, or resample Y and X so that Y does not remain constant for so many consecutive observations.

Long Answer

Looking at this I honestly think this is not the right dataset for the regression. This is a simple plot of the data:

Does a linear combination of X and time explain Y? Mmm...doesn't look plausible. Y almost looks like a discrete variable, so you probably want to look at logistic regressions.

To come to your question, the R^2 is the "the proportion of the variance in the dependent variable that is predictable from the independent variable(s)". From wikipedia:

In your case it is very likely that Y is constant over 100 observations, hence it has 0 variance, that produces a division by zero hence the inf.

So I am afraid you should not look to fixes in the code, but you should rethink the problem and the way of fitting the data.

这篇关于Python statsmodels 中缺少 OLS 回归模型的截距的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆