Python ARIMA外生变量样本不足 [英] Python ARIMA exogenous variable out of sample

查看:442
本文介绍了Python ARIMA外生变量样本不足的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在python statsmodels ARIMA包中预测一个包含一个外生变量的时间序列,但是无法找出在预测步骤中插入该外生变量的正确方法.有关文档,请参见此处.

I am trying to predict a time series in python statsmodels ARIMA package with the inclusion of an exogenous variable, but cannot figure out the correct way to insert the exogenous variable in the predict step. See here for docs.

import numpy as np
from scipy import stats
import pandas as pd

import statsmodels.api as sm

vals = np.random.rand(13)
ts = pd.TimeSeries(vals)
df = pd.DataFrame(ts, columns=["test"])
df.index = pd.Index(pd.date_range("2011/01/01", periods = len(vals), freq = 'Q'))

fit1 = sm.tsa.ARIMA(df, (1,0,0)).fit()
#this works fine:
pred1 = fit1.predict(start=12, end = 16)
print(pred1)

Out[32]: 
2014-03-31    0.589121
2014-06-30    0.747575
2014-09-30    0.631322
2014-12-31    0.654858
2015-03-31    0.650093
Freq: Q-DEC, dtype: float64

现在添加趋势外生变量

exogx = np.array(range(1,14))
#to make this easy, let's look at the ols of the trend (arima(0,0,0))
fit2 = sm.tsa.ARIMA(df, (0,0,0),exog = exogx).fit()
print(fit2.params)

const    0.555226
x1       0.013132
dtype: float64

print(fit2.fittedvalues)

2011-03-31    0.568358
2011-06-30    0.581490
2011-09-30    0.594622
2011-12-31    0.607754
2012-03-31    0.620886
2012-06-30    0.634018
2012-09-30    0.647150
2012-12-31    0.660282
2013-03-31    0.673414
2013-06-30    0.686546
2013-09-30    0.699678
2013-12-31    0.712810
2014-03-31    0.725942
Freq: Q-DEC, dtype: float64

请注意,正如我们期望的那样,这是一条趋势线,随着时间的每一次变动而增加0.013132(当然,这是随机数据,因此,如果运行它,值将有所不同,但趋势为正或负将相同).因此,下一个值(对于时间= 14)应为0.555226 + 0.013132 * 14 = 0.739074.

Notice, as we would expect, this is a trend line, increasing 0.013132 with each increase tick in time (of course this is random data, so if you run it the values will be different, but the positive or negative trend story will be the same). So, the next value (for time = 14) should be 0.555226 + 0.013132*14 = 0.739074.

#out of sample exog should be (14,15,16)
pred2 = fit2.predict(start = 12, end = 16, exog = np.array(range(13,17)))
print(pred2)
2014-03-31    0.725942
2014-06-30    0.568358
2014-09-30    0.581490
2014-12-31    0.594622
2015-03-31    0.765338
Freq: Q-DEC, dtype: float64

因此,2014-03-31正确预测了(最后一个样本),但是2014-06-30从头开始(t = 1),但是要注意2015-03-31(实际上,始终是对样本的最后观察)不管水平如何,预测都会使t = 16(即(值-截距)/beta =(0.765338-0.555226)/0.013132).

So, 2014-03-31 predicts (the last insample) correctly, but 2014-06-30 starts back at the beginning (t = 1), but notice 2015-03-31 (actually, always the last observation of the forecast, regardless of horizon) picks up t = 16 (that is, (value - intercept)/beta = (0.765338 - 0.555226)/0.013132).

为使这一点更加清楚,请注意当我增加x垫的值时会发生什么情况

To make this more clear, notice what happens when I inflate the values of of the x mat

fit2.predict(start = 12, end = 16, exog = np.array(range(13,17))*10000)
Out[41]: 
2014-03-31       0.725942
2014-06-30       0.568358
2014-09-30       0.581490
2014-12-31       0.594622
2015-03-31    2101.680532
Freq: Q-DEC, dtype: float64

看到2015-03-31爆炸了,但是没有考虑其他xmat值吗?我在这里做错了什么??

See that 2015-03-31 exploded, but none of the other xmat values were considered? What am I doing wrong here???

我尝试过各种方法,我知道如何传递exog变量(更改尺寸,将exog制成矩阵,只要输入加上地平线就将exog制成,等等,等等,等等).任何建议将不胜感激.

I have tried playing around with every way that I know how for passing in the exog variable (changing dimension, making the exog a matrix, making the exog as long as input plus the horizon, etc, etc, etc). Any suggestions would be really appreciated.

我正在使用Anaconda2.1中的2.7 numpy的1.8.1 scipy 0.14.0 熊猫0.14.0 统计模型0.5.0

I am using 2.7 from Anaconda2.1 numpy 1.8.1 scipy 0.14.0 pandas 0.14.0 statsmodels 0.5.0

,并已在Windows 7 64位和centos 64位上验证了该问题.

and have verified the issue on windows 7 64 bit, and centos 64 bit.

还有一些事情.我将ARIMA用于ARIMA功能,并且以上内容仅用于说明(也就是说,我不能仅使用OLS ...",正如我所建议的那样).由于项目的限制,我也不能仅使用R"(更普遍的是,基本Spark中缺乏R的支持).

Also, a few things. I am using ARIMA for the ARIMA functionality and the above is just for illustration (that is, I cannot "just use OLS...", as I imagine will be suggested). I also cannot "just use R" due to the restrictions of the project (and more generally, the lack of support of R in base Spark).

这是代码中有趣的部分,以防您自己尝试

Here are the interesting parts of the code all together in case you want to try it yourself

import numpy as np
from scipy import stats
import pandas as pd
import statsmodels.api as sm

vals = np.random.rand(13)
ts = pd.TimeSeries(vals)
df = pd.DataFrame(ts, columns=["test"])
df.index = pd.Index(pd.date_range("2011/01/01", periods = len(vals), freq = 'Q'))

exogx = np.array(range(1,14))
fit2 = sm.tsa.ARIMA(df, (0,0,0),exog = exogx).fit()
print(fit2.fittedvalues)
pred2 = fit2.predict(start = 12, end = 16, exog = np.array(range(13,17))*10000)
print(pred2)

推荐答案

最好在 github上发布问题跟踪器.我还是提交了门票.

This is probably better posted on the github issue tracker. I filed a ticket though.

最好在那儿提交票,否则我可能会忘记.这些天很忙.

It's best to file a ticket there, if not I might forget it. Quite busy these days.

k_ar == 0的特殊情况在逻辑中存在一个错误.应予以修复.让我知道您是否可以/不能试一试该补丁.如果没有,我可以做一些更严格的测试并将其合并.

There was a bug in the logic for the special case of k_ar == 0. Should be fixed. Let me know if you can/cannot give that patch a spin. If not, I can do some more rigorous testing and merge it.

Statsmodels在火花之上?我很感兴趣.

Statsmodels on top of spark? I'm intrigued.

这篇关于Python ARIMA外生变量样本不足的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆