Python:回顾n天滚动标准偏差 [英] Python: look back n days rolling standard deviation

查看:98
本文介绍了Python:回顾n天滚动标准偏差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对处理滚动标准偏差有疑问:

I have a question about dealing with the rolling standard deviation:

数据框如下:

2010-01-20 05:00:00   -0.011
2010-01-20 05:02:00   -0.032
2010-01-20 05:02:00   -0.037
2010-01-20 05:04:00    0.001
2010-01-20 05:06:00    0.023
2010-01-20 05:06:00    0.011
2010-01-20 05:08:00    0.049
2010-01-20 05:10:00    0.102
....
2010-05-20 17:00:00    0.022

这是从凌晨5点到下午5点的2分钟数据 (索引"yyyy-mm-dd hh:mm:ss"的格式为日期戳)

This is 2-min data from 5am to 5pm (The format of index 'yyyy-mm-dd hh:mm:ss' is datestamp)

我想计算标准差的8天回顾.我的直觉是将数据帧分成每日数据集,然后计算滚动标准偏差,但是我不知道如何处理这些索引,我想我的方法可能要花很多时间来计算.非常感谢您的帮助!

I want to calculate the 8-day look-back on the standard deviation. My intuition is to split the data frame into daily data set and then calculate the rolling standard deviation, but I don't know how to deal with these indexand i guess my methods may takes a lot of time to calculate. Thanks a lot for your help!

最后,我想要这样的结果:

Finally, I would like the result like this:

2010-01-20   0.0
2010-01-21   0.0
2010-01-22   0.0
....
2010-01-26   0.0
2010-01-27   0.12
2010-01-28   0.02
2010-01-29   0.07
...
2010-05-20   0.10

感谢您的帮助. @unutbu

Thank you for your help. @unutbu

仅在数据中发现了问题: 数据帧未完全包括整个2分钟的数据. 例如:

Just found the problem in the data: The data frame is not completely including the whole 2-min data. For example:

2010-01-21 15:08:00    0.044
2010-01-22 05:10:00    0.102

数据于2010年1月21日的15:08结束,于2010年1月22日的05:10:00开始. 因此,将窗口大小设置为常数可能无法解决此问题. 有什么建议?非常感谢

The data ends at 15:08 on 2010-01-21 and start at 05:10:00 on 2010-01-22. so setting window size with a constant may not fixed this problem. Any suggestions? thanks a lot

推荐答案

如果时间序列的频率恒定:

您可以计算8天内2秒内插次数:

You could compute the number of 2 second interals in 8 days:

window_size = pd.Timedelta('8D')/pd.Timedelta('2min')

,然后使用 pd.rolling_std window=window_size:

import pandas as pd
import numpy as np
np.random.seed(1)

index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
# the number of 2 second intervals in 8 days
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')    # 5760.0

df['std'] = pd.rolling_std(df['val'], window=window_size)
print(df.tail())

收益

                          val       std
2010-05-20 16:52:00  0.768918  0.291137
2010-05-20 16:54:00  0.486348  0.291098
2010-05-20 16:56:00  0.679610  0.291099
2010-05-20 16:58:00  0.951798  0.291114
2010-05-20 17:00:00  0.059935  0.291109

要对该时间序列重新采样以便每天获取一个值,可以使用 resample方法,并通过取均值来汇总值:

To resample this time series so as to get one value per day, you could use the resample method and aggregate the values by taking the mean:

df['std'].resample('D', how='mean')

收益

...
2010-05-16    0.289019
2010-05-17    0.289988
2010-05-18    0.289713
2010-05-19    0.289269
2010-05-20    0.288890
Freq: D, Name: std, Length: 121


上面,我们计算了滚动标准偏差,然后重新采样到一个时间 每天的频率.


Above, we computed the rolling standard deviation and then resampled to a time series with daily frequency.

如果我们要首先将原始数据重新采样为每日频率 ,然后 计算滚动标准偏差,那么通常结果将是 不同.

If we were to resample the original data to daily frequency first and then compute the rolling standard deviation then in general the result would be different.

还请注意,您的数据看起来在每个数据中都有相当大的变化 一天,因此通过取平均值进行重采样可能(错误地?)隐藏了这种差异. 因此最好先计算std.

Note also that your data looks like it has quite a bit of variation within each day, so resampling by taking the mean might (wrongly?) hide that variation. So it is probably better to compute the std first.

如果时间序列的频率不恒定:

如果您有足够的内存,我认为处理这种情况的最简单方法 是使用asfreq将时间序列扩展为具有常数的时间序列 频率.

If you have enough memory, I think the easiest way to deal with this situation is to use asfreq to expand the time series to one that has a constant frequency.

import pandas as pd
import numpy as np
np.random.seed(1)

# make an example df
index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
mask = np.random.randint(2, size=N).astype(bool)
df = df.loc[mask]

# expand the time series, filling in missing values with NaN
df = df.asfreq('2T', method=None)

# now we can use the constant-frequency solution
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')    
df['std'] = pd.rolling_std(df['val'], window=window_size, min_periods=1)

result = df['std'].resample('D', how='mean')
print(result.head())

收益

2010-01-20    0.301834
2010-01-21    0.292505
2010-01-22    0.293897
2010-01-23    0.291018
2010-01-24    0.290444
Freq: D, Name: std, dtype: float64

扩展时间序列的另一种方法是编写代码以计算时间序列 为每个8天的窗口期修改正确的子系列.虽然这是可能的,但事实是 您将不得不为时间序列的每一行进行计算,这可能会使 方法很慢.因此,我认为更快的方法是延长时间 系列.

The alternative to expanding the time series is to write code to compute the correct sub-Series for each 8-day window. While this is possible, the fact that you would have to compute this for each row of the time series could make this method very slow. Thus, I think the faster approach is to expand the time series.

这篇关于Python:回顾n天滚动标准偏差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆