使用Scikit Learn对时间序列 pandas 数据框进行线性回归 [英] Use Scikit Learn to do linear regression on a time series pandas data frame

查看:112
本文介绍了使用Scikit Learn对时间序列 pandas 数据框进行线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scikit学习线性回归器对熊猫数据框进行简单的线性回归.我的数据是一个时间序列,pandas数据框的日期时间索引为:

I'm trying to do a simple linear regression on a pandas data frame using scikit learn linear regressor. My data is a time series, and the pandas data frame has a datetime index:

                value
2007-01-01    0.771305
2007-02-01    0.256628
2008-01-01    0.670920
2008-02-01    0.098047

做一些简单的事情

from sklearn import linear_model

lr = linear_model.LinearRegression()

lr(data.index, data['value'])

不起作用:

float() argument must be a string or a number

所以我试图用日期创建一个新列以尝试对其进行转换:

So I tried to create a new column with the dates to try to transform it:

data['date'] = data.index
data['date'] = pd.to_datetime(data['date'])
lr(data['date'], data['value'])

但现在我得到了:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

因此回归器无法处理日期时间.我看到了很多将整数数据转换为日期时间的方法,但是找不到例如从日期时间转换为整数的方法.

So the regressor can't handle datetime. I saw a bunch of ways to convert integer data to datetime, but couldn't find a way to convert from datetime to integer, for example.

执行此操作的正确方法是什么?

What is the proper way to do this?

PS:我对使用scikit很感兴趣,因为我打算以后再做更多的事情,所以现在没有statsmodels.

PS: I'm interested in using scikit because I'm planning on doing more stuff with it later, so no statsmodels for now.

推荐答案

您可能希望从开始算起的天数成为此处的预测指标.假设所有内容都已排序:

You probably want something like the number of days since the start to be your predictor here. Assuming everything is sorted:

In [36]: X = (df.index -  df.index[0]).days.reshape(-1, 1)

In [37]: y = df['value'].values

In [38]: linear_model.LinearRegression().fit(X, y)
Out[38]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

您用于预测变量的确切单位并不重要,可能是数天或数月.系数和解释将发生变化,以便所有结果都能达到相同的结果.另外,请注意,我们需要一个reshape(-1, 1),以便X符合预期格式.

The exact units you use for the predictor don't really matter, it could be days or months. The coefficients and interpretation will change so that everything works out to the same result. Also, notice that we needed a reshape(-1, 1) so that the X is in the expected format.

这篇关于使用Scikit Learn对时间序列 pandas 数据框进行线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆