将日期转换为浮点以便在Pandas数据框上进行线性回归 [英] Convert date to float for linear regression on Pandas data frame

查看:188
本文介绍了将日期转换为浮点以便在Pandas数据框上进行线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了使OLS线性回归在Pandas中正常工作,似乎参数必须为浮点型.我从以下形式的csv(称为"gameAct.csv")开始:

It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:

date, city, players, sales

2014-04-28,London,111,1091.28

2014-04-29,London,100,1100.44

2014-04-28,Paris,87,1001.33

...

我想对销售如何依赖日期进行线性回归(随着时间的推移,销售如何变化?).我下面的代码的问题似乎是日期不是浮点值.对于在熊猫中解决此索引问题的帮助,我们将不胜感激.

I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.

我当前的状态(无法运行,但正在编译代码):

My current (non-working, but compiling code):

import pandas as pd

from pandas import DataFrame, Series

import statsmodels.formula.api as sm

df = pd.read_csv('gameAct.csv')

df.columns = ['date', 'city', 'players', 'sales']

city_data = df[df['city'] == 'London']

result = sm.ols(formula = 'sales ~ date', data = city_data).fit()

当我改变城市值时,我得到R ^ 2 = 1的结果,这是错误的.我还尝试了index_col = 0, parse_dates == True'来定义dataframe df,但是没有成功.

As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True' in defining the dataframe df, but without success.

我怀疑有一种更好的方法来读取此类csv文件,以对日期进行基本回归,以及进行更常规的时间序列分析.感谢帮助,示例和资源!

I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!

请注意,使用上面的代码,如果我将日期索引(用于给定城市)转换为数组,则该数组中的值的形式为:

Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:

'\xef\xbb\xbf2014-04-28'

如何对所有非销售参数进行AIC分析? (例如,结果可能是销售额最主要地取决于日期和城市).

How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).

推荐答案

对于这种回归,我通常将日期或时间戳转换为自数据开始以来的整数天.

For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.

这很好地完成了窍门:

df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])    
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()

此方法的优点是您可以确定回归中涉及的单位(天),而自动转换可能会隐式使用其他单位,从而在线性模型中造成混乱的系数.它还使您可以将在不同时间开始的多个销售活动的数据合并到回归中(例如,您对活动的有效性作为活动天数的函数感兴趣).如果您有兴趣衡量一年中的日趋势,也可以选择1月1日作为您的0.选择自己的0日期可让您掌控一切.

The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.

还有证据表明statsmodels支持熊猫的时间序列.您也可以将其应用于线性模型: http://statsmodels.sourceforge.net/stable/examples/generation/ex_dates.html

There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well: http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html

此外,请注意以下几点: 您应该能够自动从csv中直接读取列名,就像我发布的示例代码中一样.在您的示例中,我看到csv文件第一行中的逗号之间有空格,导致列名如"date".删除空格,自动读取csv标头即可.

Also, a quick note: You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.

这篇关于将日期转换为浮点以便在Pandas数据框上进行线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆