使用自定义期间进行重新采样 [英] Resampling with custom periods

查看:105
本文介绍了使用自定义期间进行重新采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有食谱"方式对(半)不规则周期的DataFrame进行重采样?

Is there a 'cookbook' way of resampling a DataFrame with (semi)irregular periods?

我每天都有一个数据集,并希望将其重新采样到有时(在科学文献中)称为dekad的数据集.我不认为有一个合适的英文术语,但它基本上是将一个月分成三天到十天的时间,其中三分之一是剩下的8到11天.

I have a dataset at a daily interval and want it to resample to what sometimes (in scientific literature) is named dekad's. I dont think there is a proper English term for it but its basically chopping a month in three ~ten-day parts where the third is a remainder of anything between 8 and 11 days.

我自己想出了两种解决方案,一种针对这种情况的解决方案,另一种针对任何不规则时期的更通用的解决方案.但是,两者都没有真正的好,所以别人对如何处理此类情况并不了解.

I came up with two solutions myself, a specific one for this case and a more general one for any irregular periods. But both arent really good, so im curiuous how others handle these type of situations.

让我们从创建一些示例数据开始:

Lets start with creating some sample data:

import pandas as pd

begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)

dtrange = pd.date_range(begin, end)

p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10

df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)

我想到的第一件事是按单个月份(YYYYMM)分组,然后手动对其进行切片.喜欢:

The first thing i came up with is grouping by individual months (YYYYMM) and then slicing it manually. Like:

def to_dec1(data, func):

    # create the indexes, start of the ~10day period
    idx1 = pd.datetime(data.index[0].year, data.index[0].month, 1)
    idx2 = idx1 + datetime.timedelta(days=10)
    idx3 = idx2 + datetime.timedelta(days=10)

    # slice the period and perform function
    oneday = datetime.timedelta(days=1)
    fir = func(data.ix[:idx2 - oneday].values, axis=0)
    sec = func(data.ix[idx2:idx3 - oneday].values, axis=0)
    thi = func(data.ix[idx3:].values, axis=0)

    return pd.DataFrame([fir,sec,thi], index=[idx1,idx2,idx3], columns=data.columns)

dfmean = df.groupby(lambda x: x.strftime('%Y%m'), group_keys=False).apply(to_dec1, np.mean)

这将导致:

print dfmean

                  p1         p2
2013-01-01  5.436778  10.409845
2013-01-11  5.534509  10.482231
2013-01-21  5.449058  10.454777
2013-02-01  5.685700  10.422697
2013-02-11  5.578137  10.532180
2013-02-21       NaN        NaN

请注意,您总是会得到整整一个月的'dekads'回报,这不是问题,可以根据需要轻松删除.

Note that you always get a full month of 'dekads' in return, its not a problem and easy to remove if needed.

另一种解决方案是通过提供一个日期范围来划分DataFrame并在每个段上执行功能.根据您想要的时间段,它更加灵活.

The other solution works by providing a range of dates at which you chop up the DataFrame and perform a function on each segment. Its more flexible in terms of the periods you want.

def to_dec2(data, dts, func):

    chucks = []
    for n,start in enumerate(dts[:-1]):

        end = dts[n+1] - datetime.timedelta(days=1)
        chucks.append(func(data.ix[start:end].values, axis=0))

    return pd.DataFrame(chucks, index=dts[:-1], columns=data.columns)

dfmean2 = to_dec2(df, dfmean.index, np.mean)

请注意,我使用上一个结果的索引作为日期范围来节省一些时间来自己构建"它.

Note that im using the index of the previous result as the range of dates to save some time 'building' it myself.

处理这些案件的最佳方法是什么?熊猫中可能还有更多内置方法吗?

What would be the best way of handling these cases? Is there perhaps a bit more build-in method in Pandas?

推荐答案

如果使用numpy 1.7,则可以使用datetime64& timedelta64数组进行计算:

If you use numpy 1.7, you can use datetime64 & timedelta64 arrays to do the calculation:

创建示例数据:

import pandas as pd
import numpy as np

begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)

dtrange = pd.date_range(begin, end)

p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10

df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)

计算dekad的日期:

calculate the dekad's date:

d = df.index.day - np.clip((df.index.day-1) // 10, 0, 2)*10 - 1
date = df.index.values - np.array(d, dtype="timedelta64[D]")
df.groupby(date).mean()

输出为:

                 p1         p2
2013-01-01  5.413795  10.445640
2013-01-11  5.516063  10.491339
2013-01-21  5.539676  10.528745
2013-02-01  5.783467  10.478001
2013-02-11  5.358787  10.579149

这篇关于使用自定义期间进行重新采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆