时间序列的分解():ValueError:您必须指定一个周期或x必须是一个带有DatetimeIndex且频率未设置为None的pandas对象 [英] decompose() for time series: ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None

查看:800
本文介绍了时间序列的分解():ValueError:您必须指定一个周期或x必须是一个带有DatetimeIndex且频率未设置为None的pandas对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在正确执行加法模型时遇到一些问题.

我有那个数据框:

当我运行这段代码时:

 将 statsmodels 导入为 sm将 statsmodels.api 导入为 sm分解 = sm.tsa.seasonal_decompose(df,模型 = 'additive')图 = 分解.plot()matplotlib.rcParams['figure.figsize'] = [9.0,5.0]

我收到了这条消息:

ValueError: 您必须指定一个句点或 x 必须是一个 Pandas 对象,其 DatetimeIndex 的频率未设置为 None

我应该怎么做才能得到那个例子:

我从这个地方截取的上面的屏幕

period = 2:

result_add =season_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=2)plt.rcParams.update({'figure.figsize': (5,5)})result_add.plot().suptitle('Additive Decompose', fontsize=22)plt.show()

如果你将所有项目的四分之一作为一个周期,这里是 4 个(共 16 个项目).

period = 4:

result_add =season_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/4))plt.rcParams.update({'figure.figsize': (5,5)})result_add.plot().suptitle('Additive Decompose', fontsize=22)plt.show()

或者,如果您在此处采用循环的最大可能大小,即 8 个(共 16 个项目).

period = 8:

result_add =season_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/2))plt.rcParams.update({'figure.figsize': (5,5)})result_add.plot().suptitle('Additive Decompose', fontsize=22)plt.show()

看看 y 轴如何改变它们的比例.

####

您将根据需要增加周期整数.您的问题的最大值:

sm.tsa.seasonal_decompose(df, model = 'additive', period = int(len(df)/2))

2.的详细信息:

要让 x 成为一个频率未设置为 None 的 DatetimeIndex,您需要使用 .asfreq('?') 和 ?您可以从 https 的各种偏移别名中进行选择://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.

在您的情况下,此选项 2. 更适合您,因为您似乎有一个没有空白的列表.您的每月数据可能应该作为月份开始频率"引入.-->MS"作为偏移别名:

sm.tsa.seasonal_decompose(df.asfreq('MS'), model = 'additive')

参见 如何使用 pd.to_datetime 设置频率()? 了解更多详情,以及您将如何处理差距.

如果您的数据在时间上高度分散,以至于有太多空白需要填补,或者如果时间空白无关紧要,则使用句点"选项 1可能是更好的选择.

在我的 df_test 示例中,选项 2. 不好.数据在时间上完全分散,如果我以一分钟为频率,你会得到:

df_test.asfreq('s') 的输出(=频率以秒为单位):

2016-05-04 08:53:20 12016-05-04 08:53:21 22016-05-04 08:53:22 12016-05-04 08:53:23 92016-05-04 08:53:24 NaN...2016-05-04 08:58:19 NaN2016-05-04 08:58:20 12016-05-04 08:58:21 12016-05-04 08:58:22 32016-05-04 08:58:23 9频率:S,名称:listData,长度:304,dtype:对象

您在这里看到,虽然我的数据只有 16 行,但引入以秒为单位的频率会迫使 df 为 304 行,仅从08:53:20"开始.直到08:58:23",这里产生了 288 个间隙.更重要的是,在这里你必须打准确的时间.如果您将 0.1 甚至 0.12314 秒作为您的实际频率,则您将不会命中大多数带有索引的项目.

这里是一个以 min 作为偏移别名的例子,df_test.asfreq('min'):

2016-05-04 08:53:20 12016-05-04 08:54:20 NaN2016-05-04 08:55:20 NaN2016-05-04 08:56:20 NaN2016-05-04 08:57:20 NaN2016-05-04 08:58:20 1

我们看到只有第一分钟和最后一分钟被填满,其余的没有被击中.

以天为偏移别名,df_test.asfreq('d'):

2016-05-04 08:53:20 1

我们看到您只得到第一行作为结果 df,因为只有一天覆盖.它会给你找到的第一个项目,其余的被丢弃.

一切都结束了:

将所有这些放在一起,在您的情况下,选择选项 2.而在我的 df_test 示例中,需要选项 1.

have some problem to execute an additive model right.

I have that data frame:

And when I run this code:

import statsmodels as sm
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(df, model = 'additive')
fig = decomposition.plot()
matplotlib.rcParams['figure.figsize'] = [9.0,5.0]

I got that message:

ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None

What should I do in order to get that example:

The screen above I took from this place https://towardsdatascience.com/analyzing-time-series-data-in-pandas-be3887fdd621

解决方案

Having the same ValueError, this is just the result of some testing and little research on my own, without the claim to be complete or professional about it. Please comment or answer whoever finds something wrong.

Of course, your data should be in the right order of the index values, which you would assure with df.sort_index(inplace=True), as you state it in your answer. This is not wrong as such, though the error message is not about the sort order, and I have checked this: the error does not go away in my case when I sort the index of a huge dataset I have at hand. It is true, I also have to sort the df.index, but the decompose() can handle unsorted data as well where items jump here and there in time: then you simply get a lot of blue lines from left to the right and back, until the whole graph is full of it. What is more, usually, the sorting is already in the right order anyway. In my case, sorting does not help fixing the error. Thus I also doubt that index sorting has fixed the error in your case, because: what does the error actually say?

ValueError: You must specify:

  1. [either] a period
  2. or x must be a pandas object with a DatetimeIndex with a freq not set to None

Before all, in case you have a list column so that your time series is nested up to now, see Convert pandas df with data in a "list column" into a time series in long format. Use three columns: [list of data] + [timestamp] + [duration] for details how to unnest a list column. This would be needed for both 1.) and 2.).

Details of 1.:

Definition of period

"period, int, optional" from https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html:

Period of the series. Must be used if x is not a pandas object or if the index of x does not have a frequency. Overrides default periodicity of x if x is a pandas object with a timeseries index.

The period parameter that is set with an integer means the number of cycles which you expect to be in the data. If you have a df with 1000 rows with a list column in it (call it df_nested), and each list with for example 100 elements, then you will have 100 elements per cycle. It is probably smart taking period = len(df_nested) (= number of cycles) in order to get the best split of seasonality and trend. If your elements per cycle vary over time, other values may be better. I am not sure about how to rightly set the parameter, therefore the question statsmodels seasonal_decompose(): What is the right "period of the series" in the context of a list column (constant vs. varying number of items) on Cross Validated which is not yet answered.

The "period" parameter of option 1.) has a big advantage over option 2.). Though it uses the time index (DatetimeIndex) for its x-axis, it does not require an item to hit the frequency exactly, in contrast to option 2.). Instead, it just joins together whatever is in a row, with the advantage that you do not need to fill any gaps: the last value of the previous event is just joined with the next value of the following event, whether it is already in the next second or on the next day.

What is the max possible "period" value? In case you have a list column (call the df "df_nested" again), you should first unnest the list column to a normal column. The max period is len(df_unnested)/2.

Example1: 20 items in x (x is the amount of all items of df_unnested) can maximally have a period = 10.

Example2: Having the 20 items and taking period=20 instead, this throws the following error:

ValueError: x must have 2 complete cycles requires 40 observations. x only has 20 observation(s)

Another side-note: To get rid of the error in question, period = 1 should already take it away, but for time series analysis, "=1" does not reveal anything new, every cycle is just 1 item then, the trend is the same as the original data, the seasonality is 0, and the residuals are always 0.

####

Example borrowed from Convert pandas df with data in a "list column" into a time series in long format. Use three columns: [list of data] + [timestamp] + [duration]

df_test = pd.DataFrame({'timestamp': [1462352000000000000, 1462352100000000000, 1462352200000000000, 1462352300000000000],
                'listData': [[1,2,1,9], [2,2,3,0], [1,3,3,0], [1,1,3,9]],
                'duration_sec': [3.0, 3.0, 3.0, 3.0]})
tdi = pd.DatetimeIndex(df_test.timestamp)
df_test.set_index(tdi, inplace=True)
df_test.drop(columns='timestamp', inplace=True)
df_test.index.name = 'datetimeindex'

df_test = df_test.explode('listData') 
sizes = df_test.groupby(level=0)['listData'].transform('size').sub(1)
duration = df_test['duration_sec'].div(sizes)
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount() * duration, unit='s') 

The resulting df_test['listData'] looks as follows:

2016-05-04 08:53:20    1
2016-05-04 08:53:21    2
2016-05-04 08:53:22    1
2016-05-04 08:53:23    9
2016-05-04 08:55:00    2
2016-05-04 08:55:01    2
2016-05-04 08:55:02    3
2016-05-04 08:55:03    0
2016-05-04 08:56:40    1
2016-05-04 08:56:41    3
2016-05-04 08:56:42    3
2016-05-04 08:56:43    0
2016-05-04 08:58:20    1
2016-05-04 08:58:21    1
2016-05-04 08:58:22    3
2016-05-04 08:58:23    9

Now have a look at different period's integer values.

period = 1:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=1)
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

period = 2:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=2)
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

If you take a quarter of all items as one cycle which is 4 (out of 16 items) here.

period = 4:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/4))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

Or if you take the max possible size of a cycle which is 8 (out of 16 items) here.

period = 8:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/2))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

Have a look at how the y-axes change their scale.

####

You will increase the period integer according to your needs. The max in your case of the question:

sm.tsa.seasonal_decompose(df, model = 'additive', period = int(len(df)/2))

Details of 2.:

To get x to be a DatetimeIndex with a freq not set to None, you need to assign the freq of the DatetimeIndex using .asfreq('?') with ? being your choice among a wide range of offset aliases from https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.

In your case, this option 2. is the better suited as you seem to have a list without gaps. Your monthly data then should probably be introduced as "month start frequency" --> "MS" as offset alias:

sm.tsa.seasonal_decompose(df.asfreq('MS'), model = 'additive')

See How to set frequency with pd.to_datetime()? for more details, also about how you would deal with gaps.

If you have data that is highly scattered in time so that you have too many gaps to fill or if gaps in time are nothing important, option 1 of using "period" is probably the better choice.

In my example case of df_test, option 2. is not good. The data is totally scattered in time, and if I take a minute as the frequency, you get this:

Output of df_test.asfreq('s') (=frequency in seconds):

2016-05-04 08:53:20      1
2016-05-04 08:53:21      2
2016-05-04 08:53:22      1
2016-05-04 08:53:23      9
2016-05-04 08:53:24    NaN
                      ...
2016-05-04 08:58:19    NaN
2016-05-04 08:58:20      1
2016-05-04 08:58:21      1
2016-05-04 08:58:22      3
2016-05-04 08:58:23      9
Freq: S, Name: listData, Length: 304, dtype: object

You see here that although my data is only 16 rows, introducing a frequency in seconds forces the df to be 304 rows only to reach out from "08:53:20" till "08:58:23", 288 gaps are caused here. What is more, here you have to hit the exact time. If you have 0.1 or even 0.12314 seconds as your real frequency instead, you will not hit most of the items with your index.

Here an example with min as the offset alias, df_test.asfreq('min'):

2016-05-04 08:53:20      1
2016-05-04 08:54:20    NaN
2016-05-04 08:55:20    NaN
2016-05-04 08:56:20    NaN
2016-05-04 08:57:20    NaN
2016-05-04 08:58:20      1

We see that only the first and the last minute are filled at all, the rest is not hit.

Taking the day as as the offset alias, df_test.asfreq('d'):

2016-05-04 08:53:20    1

We see that you get only the first row as the resulting df, since there is only one day covered. It will give you the first item found, the rest is dropped.

The end of it all:

Putting together all of this, in your case, take option 2., while in my example case of df_test, option 1 is needed.

这篇关于时间序列的分解():ValueError:您必须指定一个周期或x必须是一个带有DatetimeIndex且频率未设置为None的pandas对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆