时间序列分析 - 不均匀间隔的措施 - pandas + statsmodels [英] Time Series Analysis - unevenly spaced measures - pandas + statsmodels

查看:26
本文介绍了时间序列分析 - 不均匀间隔的措施 - pandas + statsmodels的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 numpy 数组 light_points 和 time_points,我想对这些数据使用一些时间序列分析方法.

然后我尝试了这个:

import statsmodels.api as sm将熊猫导入为 pdtdf = pd.DataFrame({'time':time_points[:]})rdf = pd.DataFrame({'light':light_points[:]})rdf.index = pd.DatetimeIndex(freq='w',start=0,periods=len(rdf.light))#rdf.index = pd.DatetimeIndex(tdf['time'])

这有效,但没有做正确的事情.事实上,测量不是均匀的时间间隔,如果我只是将 time_points pandas DataFrame 声明为我的帧的索引,我会得到一个错误:

rdf.index = pd.DatetimeIndex(tdf['time'])分解 = sm.tsa.seasonal_decompose(rdf)elif 频率为无:raise ValueError("你必须指定一个频率或者 x 必须是一个带有时间序列索引的 Pandas 对象")ValueError:您必须指定频率或 x 必须是具有时间序列索引的 Pandas 对象

我不知道如何纠正这个问题.此外,熊猫的 TimeSeries 似乎已被弃用.

我试过了:

rdf = pd.Series({'light':light_points[:]})rdf.index = pd.DatetimeIndex(tdf['time'])

但它给了我一个长度不匹配:

ValueError: Length mismatch: 预期轴有 1 个元素,新值有 122 个元素

尽管如此,我不明白它来自哪里,因为 rdf['light'] 和tdf['time'] 的长度相同...

最终,我尝试将我的 rdf 定义为熊猫系列:

rdf = pd.Series(light_points[:],index=pd.DatetimeIndex(time_points[:]))

我明白了:

ValueError: 您必须指定频率或 x 必须是具有时间序列索引的 Pandas 对象

然后,我尝试将索引替换为

 pd.TimeSeries(time_points[:])

它给了我一个关于seasonal_decompose方法行的错误:

AttributeError: 'Float64Index' 对象没有属性 'inferred_freq'

如何处理不均匀分布的数据?我正在考虑通过在现有值之间添加许多未知值并使用插值来评估"这些点来创建一个大致均匀间隔的时间数组,但我认为可能有一个更清晰、更简单的解决方案.

解决方案

seasonal_decompose() 需要一个 freq,它要么作为 DateTimeIndex<的一部分提供/code> 元信息,可以通过 pandas.Index.inferred_freq 或由用户作为 integer 推断,给出每个周期的周期数.例如,每月 12(来自 docstring 用于 seasonal_mean):

<块引用>

defseason_decompose(x, model="additive", filt=None, freq=None):"""参数----------x : 类似数组时间序列模型:str {加法",乘法"}季节性成分的类型.接受缩写.过滤:类似数组用于滤除季节性成分的滤波器系数.默认为对称移动平均线.频率:整数,可选系列的频率.如果 x 不是熊猫,则必须使用具有时间序列索引的对象.

举例说明 - 使用随机样本数据:

长度 = 400x = np.sin(np.arange(length)) * 10 + np.random.randn(length)df = pd.DataFrame(data=x, index=pd.date_range(start=datetime(2015, 1, 1), period=length, freq='w'), columns=['value'])<class 'pandas.core.frame.DataFrame'>日期时间索引:400 个条目,2015-01-04 到 2022-08-28频率:W-SUN分解 = sm.tsa.seasonal_decompose(df)数据 = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid],axis=1)data.columns = ['series', 'trend', 'seasonal', 'resid']数据列(共4列):系列 400 非空 float64趋势 348 非空 float64季节性 400 非空 float64残留 348 非空 float64数据类型:float64(4)内存使用:15.6 KB

到目前为止,一切都很好 - 现在从 DateTimeIndex 中随机删除元素以创建不均匀的空间数据:

df = df.iloc[np.unique(np.random.randint(low=0, high=length, size=length * .8))]<class 'pandas.core.frame.DataFrame'>日期时间索引:222 个条目,2015-01-11 到 2022-08-21数据列(共1列):值 222 非空 float64数据类型:float64(1)内存使用:3.5 KBdf.index.freq没有任何df.index.inferred_freq没有任何

在此数据上运行 seasonal_decomp 有效:

decomp = sm.tsa.seasonal_decompose(df, freq=52)数据 = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid],axis=1)data.columns = ['series', 'trend', 'seasonal', 'resid']日期时间索引:224 个条目,2015-01-04 到 2022-08-07数据列(共4列):系列 224 非空 float64趋势 172 非空 float64季节性 224 非空 float64残留 172 非空 float64数据类型:float64(4)内存使用:8.8 KB

问题是 - 结果有多大用处.即使数据中没有使季节性模式推断复杂化的差距(请参阅 发行说明statsmodels 限定此过程如下:

<块引用>

注释-----这是一种幼稚的分解.更复杂的方法应该首选.加性模型是 Y[t] = T[t] + S[t] + e[t]乘法模型是 Y[t] = T[t] * S[t] * e[t]首先通过应用卷积去除季节性分量过滤数据.每个平滑序列的平均值period 是返回的季节性分量.

I have two numpy arrays light_points and time_points and would like to use some time series analysis methods on those data.

I then tried this :

import statsmodels.api as sm
import pandas as pd
tdf = pd.DataFrame({'time':time_points[:]})
rdf =  pd.DataFrame({'light':light_points[:]})
rdf.index = pd.DatetimeIndex(freq='w',start=0,periods=len(rdf.light))
#rdf.index = pd.DatetimeIndex(tdf['time'])

This works but is not doing the correct thing. Indeed, the measurements are not evenly time-spaced and if I just declare the time_points pandas DataFrame as the index of my frame, I get an error :

rdf.index = pd.DatetimeIndex(tdf['time'])

decomp = sm.tsa.seasonal_decompose(rdf)

elif freq is None:
raise ValueError("You must specify a freq or x must be a pandas object with a timeseries index")

ValueError: You must specify a freq or x must be a pandas object with a timeseries index

I don't know how to correct this. Also, it seems that pandas' TimeSeries are deprecated.

I tried this :

rdf = pd.Series({'light':light_points[:]})
rdf.index = pd.DatetimeIndex(tdf['time'])

But it gives me a length mismatch :

ValueError: Length mismatch: Expected axis has 1 elements, new values have 122 elements

Nevertheless, I don't understand where it comes from, as rdf['light'] and tdf['time'] are of same length...

Eventually, I tried by defining my rdf as a pandas Series :

rdf = pd.Series(light_points[:],index=pd.DatetimeIndex(time_points[:]))

And I get this :

ValueError: You must specify a freq or x must be a pandas object with a timeseries index

Then, I tried instead replacing the index by

 pd.TimeSeries(time_points[:])

And it gives me an error on the seasonal_decompose method line :

AttributeError: 'Float64Index' object has no attribute 'inferred_freq'

How can I work with unevenly spaced data ? I was thinking about creating an approximately evenly spaced time array by adding many unknown values between the existing values and using interpolation to "evaluate" those points, but I think there could be a cleaner and easier solution.

解决方案

seasonal_decompose() requires a freq that is either provided as part of the DateTimeIndex meta information, can be inferred by pandas.Index.inferred_freq or else by the user as an integer that gives the number of periods per cycle. e.g., 12 for monthly (from docstring for seasonal_mean):

def seasonal_decompose(x, model="additive", filt=None, freq=None):
    """
    Parameters
    ----------
    x : array-like
        Time series
    model : str {"additive", "multiplicative"}
        Type of seasonal component. Abbreviations are accepted.
    filt : array-like
        The filter coefficients for filtering out the seasonal component.
        The default is a symmetric moving average.
    freq : int, optional
        Frequency of the series. Must be used if x is not a pandas
        object with a timeseries index.

To illustrate - using random sample data:

length = 400
x = np.sin(np.arange(length)) * 10 + np.random.randn(length)
df = pd.DataFrame(data=x, index=pd.date_range(start=datetime(2015, 1, 1), periods=length, freq='w'), columns=['value'])

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 400 entries, 2015-01-04 to 2022-08-28
Freq: W-SUN

decomp = sm.tsa.seasonal_decompose(df)
data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1)
data.columns = ['series', 'trend', 'seasonal', 'resid']

Data columns (total 4 columns):
series      400 non-null float64
trend       348 non-null float64
seasonal    400 non-null float64
resid       348 non-null float64
dtypes: float64(4)
memory usage: 15.6 KB

So far, so good - now randomly dropping elements from the DateTimeIndex to create unevenly space data:

df = df.iloc[np.unique(np.random.randint(low=0, high=length, size=length * .8))]

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 222 entries, 2015-01-11 to 2022-08-21
Data columns (total 1 columns):
value    222 non-null float64
dtypes: float64(1)
memory usage: 3.5 KB

df.index.freq

None

df.index.inferred_freq

None

Running the seasonal_decomp on this data 'works':

decomp = sm.tsa.seasonal_decompose(df, freq=52)

data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1)
data.columns = ['series', 'trend', 'seasonal', 'resid']

DatetimeIndex: 224 entries, 2015-01-04 to 2022-08-07
Data columns (total 4 columns):
series      224 non-null float64
trend       172 non-null float64
seasonal    224 non-null float64
resid       172 non-null float64
dtypes: float64(4)
memory usage: 8.8 KB

The question is - how useful is the result. Even without gaps in the data that complicate inference of seasonal patterns (see example use of .interpolate() in the release notes, statsmodels qualifies this procedure as follows:

Notes
-----
This is a naive decomposition. More sophisticated methods should
be preferred.

The additive model is Y[t] = T[t] + S[t] + e[t]

The multiplicative model is Y[t] = T[t] * S[t] * e[t]

The seasonal component is first removed by applying a convolution
filter to the data. The average of this smoothed series for each
period is the returned seasonal component.

这篇关于时间序列分析 - 不均匀间隔的措施 - pandas + statsmodels的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆