使用自定义期间进行重新采样 [英] Resampling with custom periods
问题描述
是否有食谱"方式对(半)不规则周期的DataFrame进行重采样?
Is there a 'cookbook' way of resampling a DataFrame with (semi)irregular periods?
我每天都有一个数据集,并希望将其重新采样到有时(在科学文献中)称为dekad的数据集.我不认为有一个合适的英文术语,但它基本上是将一个月分成三天到十天的时间,其中三分之一是剩下的8到11天.
I have a dataset at a daily interval and want it to resample to what sometimes (in scientific literature) is named dekad's. I dont think there is a proper English term for it but its basically chopping a month in three ~ten-day parts where the third is a remainder of anything between 8 and 11 days.
我自己想出了两种解决方案,一种针对这种情况的解决方案,另一种针对任何不规则时期的更通用的解决方案.但是,两者都没有真正的好,所以别人对如何处理此类情况并不了解.
I came up with two solutions myself, a specific one for this case and a more general one for any irregular periods. But both arent really good, so im curiuous how others handle these type of situations.
让我们从创建一些示例数据开始:
Lets start with creating some sample data:
import pandas as pd
begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)
dtrange = pd.date_range(begin, end)
p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10
df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)
我想到的第一件事是按单个月份(YYYYMM)分组,然后手动对其进行切片.喜欢:
The first thing i came up with is grouping by individual months (YYYYMM) and then slicing it manually. Like:
def to_dec1(data, func):
# create the indexes, start of the ~10day period
idx1 = pd.datetime(data.index[0].year, data.index[0].month, 1)
idx2 = idx1 + datetime.timedelta(days=10)
idx3 = idx2 + datetime.timedelta(days=10)
# slice the period and perform function
oneday = datetime.timedelta(days=1)
fir = func(data.ix[:idx2 - oneday].values, axis=0)
sec = func(data.ix[idx2:idx3 - oneday].values, axis=0)
thi = func(data.ix[idx3:].values, axis=0)
return pd.DataFrame([fir,sec,thi], index=[idx1,idx2,idx3], columns=data.columns)
dfmean = df.groupby(lambda x: x.strftime('%Y%m'), group_keys=False).apply(to_dec1, np.mean)
这将导致:
print dfmean
p1 p2
2013-01-01 5.436778 10.409845
2013-01-11 5.534509 10.482231
2013-01-21 5.449058 10.454777
2013-02-01 5.685700 10.422697
2013-02-11 5.578137 10.532180
2013-02-21 NaN NaN
请注意,您总是会得到整整一个月的'dekads'回报,这不是问题,可以根据需要轻松删除.
Note that you always get a full month of 'dekads' in return, its not a problem and easy to remove if needed.
另一种解决方案是通过提供一个日期范围来划分DataFrame并在每个段上执行功能.根据您想要的时间段,它更加灵活.
The other solution works by providing a range of dates at which you chop up the DataFrame and perform a function on each segment. Its more flexible in terms of the periods you want.
def to_dec2(data, dts, func):
chucks = []
for n,start in enumerate(dts[:-1]):
end = dts[n+1] - datetime.timedelta(days=1)
chucks.append(func(data.ix[start:end].values, axis=0))
return pd.DataFrame(chucks, index=dts[:-1], columns=data.columns)
dfmean2 = to_dec2(df, dfmean.index, np.mean)
请注意,我使用上一个结果的索引作为日期范围来节省一些时间来自己构建"它.
Note that im using the index of the previous result as the range of dates to save some time 'building' it myself.
处理这些案件的最佳方法是什么?熊猫中可能还有更多内置方法吗?
What would be the best way of handling these cases? Is there perhaps a bit more build-in method in Pandas?
推荐答案
如果使用numpy 1.7,则可以使用datetime64& timedelta64数组进行计算:
If you use numpy 1.7, you can use datetime64 & timedelta64 arrays to do the calculation:
创建示例数据:
import pandas as pd
import numpy as np
begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)
dtrange = pd.date_range(begin, end)
p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10
df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)
计算dekad的日期:
calculate the dekad's date:
d = df.index.day - np.clip((df.index.day-1) // 10, 0, 2)*10 - 1
date = df.index.values - np.array(d, dtype="timedelta64[D]")
df.groupby(date).mean()
输出为:
p1 p2
2013-01-01 5.413795 10.445640
2013-01-11 5.516063 10.491339
2013-01-21 5.539676 10.528745
2013-02-01 5.783467 10.478001
2013-02-11 5.358787 10.579149
这篇关于使用自定义期间进行重新采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!