时间序列 pandas 的线性回归 [英] Linear Regression from Time Series Pandas

查看:86
本文介绍了时间序列 pandas 的线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用时间序列作为预测变量进行回归,因此我尝试遵循此SO答案给出的答案(

I would like to get a regression with a time series as a predictor and I'm trying to follow the answer give on this SO answer (OLS with pandas: datetime index as predictor) but it no longer seems to work to the best of my knowledge.

我错过了什么吗?还是有一种新的方式来做到这一点?

Am I missing something or is there a new way to do this?

import pandas as pd

rng = pd.date_range('1/1/2011', periods=4, freq='H')       
s = pd.Series(range(4), index = rng)                                                                      
z = s.reset_index()

pd.ols(x=z["index"], y=z[0]) 

我收到此错误.该错误是解释性的,但我想知道重新实现以前有效的解决方案时缺少的内容.

I'm getting this error. The error is explanatory but I'm wondering what I'm missing in reimplementing a solution that worked before.

TypeError:无法将类似datetime的类型从[datetime64 [ns]]分配为[float64]

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]

推荐答案

我不确定为什么pd.ols在那儿太挑剔了(在我看来,您正确地遵循了该示例).我怀疑这是由于熊猫处理或存储日期时间索引的方式发生了变化,但我懒得进一步探讨.无论如何,由于datetime变量仅在小时中有所不同,因此您可以使用dt访问器提取小时:

I'm not sure why pd.ols is so picky there (it does appear to me that you followed the example correctly). I suspect this is due to changes in how pandas handles or stores datetime indexes but am too lazy to explore this further. Anyway, since your datetime variable differs only in the hour, you could just extract the hour with a dt accessor:

pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])

但是,由于您的模型在包含截距的情况下被过度指定(并且y是x的线性函数),因此r平方为1.您可以将range更改为np.random.randn,然后您会得到类似于正常回归结果的结果.

However, that gives you an r-squared of 1, since your model is overspecified with the inclusion of an intercept (and y being a linear function of x). You could change the range to np.random.randn and then you'd get something that looks like normal regression results.

In [6]: z = pd.Series(np.random.randn(4), index = rng).reset_index()                                                               
        pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])
Out[6]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         4
Number of Degrees of Freedom:   2

R-squared:         0.7743
Adj R-squared:     0.6615

Rmse:              0.5156

F-stat (1, 2):     6.8626, p-value:     0.1200

Degrees of Freedom: model 1, resid 2

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.6040     0.2306      -2.62     0.1200    -1.0560    -0.1521
     intercept     0.2915     0.4314       0.68     0.5689    -0.5540     1.1370
---------------------------------End of Summary---------------------------------

或者,您也可以将索引转换为整数,尽管我发现它不能很好地工作(我假设是因为整数代表自历元或类似时间以来的纳秒,因此非常大,并且会导致精度下降问题),但将其转换为整数并除以一万亿左右就可以了,并且得到的结果基本上与使用dt.hour相同(即,相同的r平方):

Alternatively, you could convert the index to an integer, although I found this didn't work very well (I'm assuming because the integers represent nanoseconds since the epoch or something like that, and hence are very large and cause precision issues), but converting to integer and dividing by a trillion or so did work and gave essentially the same results as using dt.hour (i.e. same r-squared):

pd.ols(x=pd.to_datetime(z["index"]).astype(int)/1e12, y=z[0])

错误消息的来源

FWIW,看来该错误消息来自以下内容:

FWIW, it looks like that error message is coming from something like this:

pd.to_datetime(z["index"]).astype(float)

尽管这是一个相当明显的解决方法:

Although a fairly obvious workaround is this:

pd.to_datetime(z["index"]).astype(int).astype(float)

这篇关于时间序列 pandas 的线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆