pandas /统计模型OLS预测未来价值 [英] Pandas/Statsmodel OLS predicting future values
问题描述
我一直在尝试对自己创建的模型中的未来价值进行预测.我已经在pandas和statsmodels中尝试了OLS.这是我在statsmodels中拥有的东西:
I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
返回的数组的长度等于我原始数据帧中的记录数,但是值不相同.当我使用熊猫执行以下操作时,没有返回任何值.
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(请注意,Pandas中没有针对OLS的.fit函数)有人可以阐明我如何从PLS中的OLS模型或statsmodel中获得未来的预测-我意识到我一定不能正确使用.predict和我已经阅读了人们遇到的其他多个问题,但这些问题似乎不适用于我的情况.
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
编辑我相信定义的"endog"是不正确的-我应该传递我要预测的值;因此,我创建的日期范围比上次记录的值晚12个时间段.但是当我遇到错误时,我仍然想念一些东西:
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
编辑,这是一小段数据,数字的最后一列(红色)是日期变化量,与第一个日期相差的月份数:
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085
推荐答案
我认为您的问题是statsmodels在默认情况下不会添加拦截,因此您的模型无法达到理想的效果.要在您的代码中解决它,将是这样的:
I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
此外,就其价值而言,我认为statsmodel公式api在处理DataFrame时要好得多,并且默认情况下添加了一个拦截器(添加一个- 1
来删除).参见下文,它应该给出相同的答案.
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1
to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
要预测未来值,只需将新数据传递给.predict()
例如,使用第一个模型:
To predict future values, just pass new data to .predict()
For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
在截距上-数字1
中没有任何编码,它仅基于OLS的数学原理(截距与始终等于1的回归变量完全相似),因此您可以从摘要中提取该值.查看statsmodels docs ,这是添加截距的另一种方法将是:
On the intercept - there's nothing encoded in the number 1
it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)
这篇关于 pandas /统计模型OLS预测未来价值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!