Python-滚动窗口OLS回归估计 [英] Python - Rolling window OLS Regression estimation

查看:249
本文介绍了Python-滚动窗口OLS回归估计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了进行评估,我在此链接中找到了一个数据集( https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk )如下格式.数据集中的第三列(Y)是我的真实值-这就是我要预测(估计)的值.

For my evaluation, I have a dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format. The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate).

 time     X   Y
0.000543  0  10
0.000575  0  10
0.041324  1  10
0.041331  2  10
0.041336  3  10
0.04134   4  10
  ...
9.987735  55 239
9.987739  56 239
9.987744  57 239
9.987749  58 239
9.987938  59 239

我想对例如5个窗口OLS regression estimation进行滚动,并且已经使用以下脚本进行了尝试.

I want to run a rolling of for example 5 window OLS regression estimation, and I have tried it with the following script.

# /usr/bin/python -tt

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('estimated_pred.csv')

model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']], 
                               window_type='rolling', window=5, intercept=True)
df['Y_hat'] = model.y_predict

print(df['Y_hat'])
print (model.summary)
df.plot.scatter(x='X', y='Y', s=0.1)

回归分析的摘要如下所示.

The summary of the regression analysis is shown below.

   -------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <X> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   2

R-squared:           -inf
Adj R-squared:       -inf

Rmse:              0.0000

F-stat (1, 3):        nan, p-value:        nan

Degrees of Freedom: model 1, resid 3

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             X     0.0000     0.0000       1.97     0.1429     0.0000     0.0000
     intercept   239.0000     0.0000 14567091934632472.00     0.0000   239.0000   239.0000
---------------------------------End of Summary---------------------------------

我想对t+1处的Y进行向后预测(即通过包括均方误差(MSE)根据先前的值(即p(Y)t+1来预测Y的下一个值)例如,如果我们查看第5行,则X的值是2,而Y的值是10.假设预测值(p(Y)t+1)是6,因此mse将是(10-6)^2

I want to do a backward prediction of Y at t+1 (i.e. predict the next value of Y according to the previous value i.e. p(Y)t+1 by including the mean squared error (MSE) - for example, if we look at row 5, the value of X is 2 and the value of Y is 10. Let's say the prediction value (p(Y)t+1) is 6 and therefore the mse will be (10-6)^2. How can we do this using either statsmodels or scikit-learn for pd.stats.ols.MovingOLS was removed in Pandas version 0.20.0 and since I can't find any reference?

推荐答案

以下是使用statsmodels进行滚动OLS的概述,应该适用于您的数据.只需使用df=pd.read_csv('estimated_pred.csv')而不是我随机生成的df:

Here is an outline of doing rolling OLS with statsmodels and should work for your data. simply use df=pd.read_csv('estimated_pred.csv') instead of my randomly generated df:

import pandas as pd
import numpy as np
import statsmodels.api as sm

#random data
#df=pd.DataFrame(np.random.normal(size=(500,3)),columns=['time','X','Y'])
df=pd.read_csv('estimated_pred.csv')    
df=df.dropna() #uncomment this line to drop nans
window = 5

df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
    temp=df.iloc[i-window:i,:]
    RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']])).fit()
    df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
    df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
    df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]

#The following line gives you predicted values in a row, given the PRIOR row's estimated parameters
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']

我存储常数和beta,但是有很多方法可以预测...您可以使用拟合的模型对象,例如RollOLS.predict()方法,也可以自己乘以我在其中所做的最后一行(在这种情况下,这样做很容易,因为变量的数量是固定的并且是已知的,您可以一次性完成简单的列数学运算.)

I store the constant and betas, but there are a number of ways to approach predicting... you can use your fitted model object mine is RollOLS and the .predict() method, or multiply it yourself which I did in the final line (easier to do this way in this case because number of variables is fixed and known and you can do simple column math all in one go).

使用sm进行预测,尽管看起来像这样:

to do predictions with sm though as you go it would look like this:

predict_x=np.random.normal(size=(20,2))
RollOLS.predict(sm.add_constant(predict_x))

但是请记住,如果按顺序运行上述代码,则预测值将仅使用最后一个窗口的模型.如果要使用其他模型,则可以随时保存它们,或在for循环中预测值.请注意,您还可以使用RollOLS.fittedvalues获得拟合值,因此,如果要平滑数据提取并为循环中的每次迭代保存RollOLS.fittedvalues[-1].

but keep in mind, if you ran the above code in sequence the predicted values would be using the model of the last window only. if you want to use a different model then you can save those as you go, or predict values within the for loop. Note you can also get fitted values with RollOLS.fittedvalues, and so if you are smoothing data pull and save RollOLS.fittedvalues[-1] for each iteration in the loop.

运行滚动回归循环后,这是我df的尾巴,以帮助查看如何使用您自己的数据:

To help see how to use for your own data here is the tail of my df after the rolling regression loop is run:

      time         X           Y           a           b1          b2
495 0.662463    0.771971    0.643008    -0.0235751  0.037875    0.0907694
496 -0.127879   1.293141    0.404959    0.00314073  0.0441054   0.113387
497 -0.006581   -0.824247   0.226653    0.0105847   0.0439867   0.118228
498 1.870858    0.920964    0.571535    0.0123463   0.0428359   0.11598
499 0.724296    0.537296    -0.411965   0.00104044  0.055003    0.118953

这篇关于Python-滚动窗口OLS回归估计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆