Python statsmodels中缺少OLS回归模型的拦截 [英] Missing intercepts of OLS Regression models in Python statsmodels

查看:178
本文介绍了Python statsmodels中缺少OLS回归模型的拦截的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对此中找到的数据集的100个窗口OLS regression estimation进行滚动链接( https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk ),格式如下.

I am running a rolling for example of 100 window OLS regression estimation of the dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format.

 time     X   Y
0.000543  0  10
0.000575  0  10
0.041324  1  10
0.041331  2  10
0.041336  3  10
0.04134   4  10
  ...
9.987735  55 239
9.987739  56 239
9.987744  57 239
9.987749  58 239
9.987938  59 239

数据集中的第三列(Y)是我的真实值-这就是我想要预测(估计)的值.我想进行Y的预测(即根据X的前3个滚动值预测Y的当前值.为此,我使用statsmodels进行以下python脚本工作.

The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate). I want to do a prediction of Y (i.e. predict the current value of Y according to the previous 3 rolling values of X. For this, I have the following python script work using statsmodels.

# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm


df=pd.read_csv('estimated_pred.csv')    
df=df.dropna() # to drop nans in case there are any
window = 100
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
    temp=df.iloc[i-window:i,:]
    RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']], has_constant = 'add')).fit()
    df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
    df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
    df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]

# Predicted values in a row
 df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']

#print(df['predicted'])

print(temp)

这给了我以下格式的示例输出.

Which gives me a sample output of the following format.

         time   X   Y        a           b1           b2  predicted
0    0.000543   0  10     None         None         None       NaN
1    0.000575   0  10     None         None         None       NaN
2    0.041324   1  10     None         None         None       NaN
3    0.041331   2  10     None         None         None       NaN
4    0.041336   3  10     None         None         None       NaN
..        ...  ..  ..      ...          ...          ...       ...
50    0.041340   4  10       10            0  1.55431e-15       NaN
51    0.041345   5  10       10   1.7053e-13  7.77156e-16        10
52    0.041350   6  10       10  1.74623e-09 -7.99361e-15        10
53    0.041354   7  10       10  6.98492e-10 -6.21725e-15        10
..        ...  ..  ..      ...          ...          ...       ...
509  0.160835  38  20       20  4.88944e-09 -1.15463e-14        20
510  0.160839  39  20       20  1.86265e-09  5.32907e-15        20
..        ...  ..  ..      ...          ...          ...       ...

最后,我想包括所有预测值(OLS回归分析的摘要)的均方误差(MSE).例如,如果我们查看第5行,则X的值为2,而Y的值为10.假设当前行中的y的预测值为6,因此mse将是(10-6)^2.当我们执行print (RollOLS.summary())时,sm.OLS返回此类<class 'statsmodels.regression.linear_model.OLS'>的实例.

Finally, I want to include the mean squared error (MSE) for all the prediction (a summary of the OLS regression analysis) values. For example, if we look at row 5, the value of X is 2 and the value of Y is 10. Let's say the prediction value of y at the current row is 6 and therefore the mse will be (10-6)^2. The sm.OLS returns an instance of this class <class 'statsmodels.regression.linear_model.OLS'> when we do print (RollOLS.summary()).

OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -48.50
Date:                Tue, 04 Jul 2017   Prob (F-statistic):               1.00
Time:                        22:19:18   Log-Likelihood:                 2359.7
No. Observations:                 100   AIC:                            -4713.
Df Residuals:                      97   BIC:                            -4706.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        239.0000   2.58e-09   9.26e+10      0.000       239.000   239.000
time        4.547e-13   2.58e-10      0.002      0.999     -5.12e-10  5.13e-10
X          -3.886e-16    1.1e-13     -0.004      0.997     -2.19e-13  2.19e-13
==============================================================================
Omnibus:                       44.322   Durbin-Watson:                   0.000
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               86.471
Skew:                          -1.886   Prob(JB):                     1.67e-19
Kurtosis:                       5.556   Cond. No.                     9.72e+04
==============================================================================

但是rsquared(print (RollOLS.rsquared))的值应该在01之间,而不是-inf之间,这似乎是missing intercepts的问题.如果我们想打印mse,我们按照

But the value of rsquared(print (RollOLS.rsquared)), for example, should have been between 0 and 1 instead of -inf and this seems to be the issue with missing intercepts. If we want to print the mse, we do print (RollOLS.mse_model)... etc as per the documentation. How can we add the intercepts and print the regression statistics with the correct values as we do for the predicted values? What am I doing wrong in here? Or is there another way of doing this using scikit-learnlibraries?

推荐答案

简短回答

只要y在回归窗口内保持恒定(在您的情况下为100个观察值),r^2的值将为+/- inf.您可以在下面找到更多详细信息,但直觉是r^2y的方差的一部分,由X解释:如果y的方差为零,则r^2的定义不明确.

The value of r^2 is going to be +/- inf as long as y remains constant over the regression window (100 observations in your case). You can find more details below, but intuition is that r^2 is the proportion of y's variance explained by X: if y's variance is zero, r^2 is simply not well defined.

可能的解决方案:尝试使用更长的窗口,或者

Possible solution: Try to use a longer window, or resample Y and X so that Y does not remain constant for so many consecutive observations.

长期回答

老实说,我认为这不是回归的正确数据集. 这是数据的简单图解:

Looking at this I honestly think this is not the right dataset for the regression. This is a simple plot of the data:

X和时间的线性组合可以解释Y吗?嗯...看起来不太合理. Y几乎看起来像一个离散变量,因此您可能希望查看逻辑回归.

Does a linear combination of X and time explain Y? Mmm...doesn't look plausible. Y almost looks like a discrete variable, so you probably want to look at logistic regressions.

要问您的问题,R ^ 2是可从自变量预测的因变量中方差的比例". 来自维基百科:

To come to your question, the R^2 is the "the proportion of the variance in the dependent variable that is predictable from the independent variable(s)". From wikipedia:

在您的情况下,Y在100个观测值中很可能是恒定,因此它的方差为0,因此除以零就是inf.

In your case it is very likely that Y is constant over 100 observations, hence it has 0 variance, that produces a division by zero hence the inf.

因此,恐怕您不应该查看代码中的修复程序,而应该重新考虑问题和拟合数据的方式.

So I am afraid you should not look to fixes in the code, but you should rethink the problem and the way of fitting the data.

这篇关于Python statsmodels中缺少OLS回归模型的拦截的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆