重新采样/上采样周期指数,并同时使用两个极端时间“边缘".数据 [英] Resample/Upsample Period Index and using both extreme time "edges" of the data

查看:76
本文介绍了重新采样/上采样周期指数,并同时使用两个极端时间“边缘".数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下DataFrame,一个带有周期索引的每周价格数据时间表.我们称之为df

I have the following DataFrame, a weekly price data timeserie with a Period Index. Let's call it df

                            timestamp         open        high        low        close  volume
timestamp                       
2009-02-01/2009-02-07   733442.166309   830.540773  832.586910  828.788627  830.706009  48401.952790
2009-02-08/2009-02-14   733449.166309   839.945279  841.763948  837.812232  839.742489  53429.330472
2009-02-15/2009-02-21   733456.245777   790.733108  792.399775  788.897523  790.549550  50671.887387
2009-02-22/2009-02-28   733463.166309   760.586910  762.640558  758.234979  760.428112  60565.506438

如果我尝试使用df.resample('30min').mean()对其重新采样,则数据以2009-02-22结尾.我希望它以2009-02-28结尾,同时仍以2009-02-01开始.我该怎么办?
我怀疑这与resample函数的closedlabel值有关,但是这些在文档中没有得到很好的解释.

If I try to resample it with df.resample('30min').mean() the data ends at 2009-02-22. I would like it to end at 2009-02-28, while still starting at 2009-02-01. How can I do that?
I suspect it has to do with the closed and label values of the resample function, but those are not very well explained in the doc.

以下是用于重构数据帧的代码段:

Here a snippet to reconstruct the dataframe:

import pandas as pd
from pandas import Period
dikt={'volume': {Period('2009-02-01/2009-02-07', 'W-SAT'): 48401.952789699571, Period('2009-02-08/2009-02-14', 'W-SAT'): 53429.330472103007, Period('2009-02-15/2009-02-21', 'W-SAT'): 50671.887387387389, Period('2009-02-22/2009-02-28', 'W-SAT'): 60565.506437768243}, 'close': {Period('2009-02-01/2009-02-07', 'W-SAT'): 830.70600858369096, Period('2009-02-08/2009-02-14', 'W-SAT'): 839.74248927038627, Period('2009-02-15/2009-02-21', 'W-SAT'): 790.54954954954951, Period('2009-02-22/2009-02-28', 'W-SAT'): 760.42811158798281}, 'open': {Period('2009-02-01/2009-02-07', 'W-SAT'): 830.54077253218884, Period('2009-02-08/2009-02-14', 'W-SAT'): 839.94527896995703, Period('2009-02-15/2009-02-21', 'W-SAT'): 790.73310810810813, Period('2009-02-22/2009-02-28', 'W-SAT'): 760.58690987124464}, 'high': {Period('2009-02-01/2009-02-07', 'W-SAT'): 832.58690987124464, Period('2009-02-08/2009-02-14', 'W-SAT'): 841.76394849785413, Period('2009-02-15/2009-02-21', 'W-SAT'): 792.39977477477476, Period('2009-02-22/2009-02-28', 'W-SAT'): 762.64055793991417}, 'low': {Period('2009-02-01/2009-02-07', 'W-SAT'): 828.78862660944208, Period('2009-02-08/2009-02-14', 'W-SAT'): 837.8122317596567, Period('2009-02-15/2009-02-21', 'W-SAT'): 788.89752252252254, Period('2009-02-22/2009-02-28', 'W-SAT'): 758.23497854077254}, 'timestamp': {Period('2009-02-01/2009-02-07', 'W-SAT'): 733442.16630901292, Period('2009-02-08/2009-02-14', 'W-SAT'): 733449.16630901292, Period('2009-02-15/2009-02-21', 'W-SAT'): 733456.24577702698, Period('2009-02-22/2009-02-28', 'W-SAT'): 733463.16630901292}}
pd.DataFrame(dikt, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])

推荐答案

由于要包含与第一个PeriodIndex对应的start_time和与最后一个PeriodIndex对应的end_time,因此 DF.resample 在这里几乎没有帮助,因为它们本质上是整体/互斥的(意味着更改任何arg都会影响start_timeend_time,但不会同时影响两者).

Since you want to include the start_time corresponding to the first PeriodIndex and end_time corresponding to the last one, the keyword arguments present in DF.resample would be of little help here as these operate as a whole/mutually exclusive in nature (meaning altering any arg would affect either the start_time or end_time but not both).

相反,您可以对这些样本进行降采样以采用每天的频率,"D",然后在30分钟内对每个组进行均值汇总.

Instead, you could downsample these to take on the day frequency, "D" and then perform the aggregation of mean for each group within 30 minutes.

df.resample('D').asfreq().resample('30T').mean()

如果要专门对start_timeend_time进行重采样,则可以使用convention arg.

The convention arg could have been used if resampling across start_time or end_time specifically were to be performed.

要检查:

To check:

resamp_start = df.resample('30min').mean()
resamp_all = df.resample('D').asfreq().resample('30T').mean().head(resamp_start.shape[0])
resamp_start.equals(resamp_all)
True


如果仅需要重新采样的索引而不是其汇总,则将其当前频率下采样到与要在中重新采样的频率相对应的最低整数频率是有意义的[此处为1分钟] ,然后每30行取一个切片,以每 30分钟个样本进行计算.


If you require only the resampled index and not it's aggregation, then it would make sense to down-sample it's current frequency to the lowest integer frequency corresponding to the frequency that is to be resampled for [Here, 1 minute] and then take slices of every 30 rows to compute this for every 30 minute sample.

df.resample('T').asfreq().iloc[::30]

与较早的情况相比,这些操作会为您提供整个2009-02-28的样本,在早期情况下,由于在.resample('D')操作过程中对其进行了归一化(时间调整为午夜),因此考虑了不超过2009-02-28的日期

These would give you the samples for the whole of 2009-02-28 as compared to the earlier case where the dates upto and not including 2009-02-28 were considered due to their normalization (times adjusted to midnight) imposed during .resample('D') operation.

这篇关于重新采样/上采样周期指数,并同时使用两个极端时间“边缘".数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆