使用包含时间序列的多索引重新采样 pandas 数据框 [英] Resampling a pandas dataframe with multi-index containing timeseries

查看:35
本文介绍了使用包含时间序列的多索引重新采样 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为创建似乎是此问题的副本而道歉.我有一个数据框,其形状大致如下图所示:

apologies from creating what appears to be a duplicate of this question. I have a dataframe that is shaped more or less like the one below:

df_lenght = 240
df = pd.DataFrame(np.random.randn(df_lenght,2), columns=['a','b'] )
df['datetime'] = pd.date_range('23/06/2017', periods=df_lenght, freq='H')

unique_jobs = ['job1','job2','job3',]
job_id = [unique_jobs for i in range (1, int((df_lenght/len(unique_jobs))+1) ,1) ]
df['job_id'] = sorted( [val for sublist in job_id for val in sublist] )

df.set_index(['job_id','datetime'], append=True, inplace=True)

print(df[:5])返回:

                                     a         b
  job_id datetime                               
0 job1   2017-06-23 00:00:00 -0.067011 -0.516382
1 job1   2017-06-23 01:00:00 -0.174199  0.068693
2 job1   2017-06-23 02:00:00 -1.227568 -0.103878
3 job1   2017-06-23 03:00:00 -0.847565 -0.345161
4 job1   2017-06-23 04:00:00  0.028852  3.111738

我将需要对df['a']重新采样以得出每日滚动平均值,即应用.resample('D').mean().rolling(window=2).mean().

I will need to resample df['a'] to derive a daily rolling mean, i.e. apply a .resample('D').mean().rolling(window=2).mean().

我尝试了两种方法:

1-建议在此处

df.unstack('job_id','datetime').resample('D').mean().rolling(window=2).mean().stack('job_id', 'datetime')

这将返回错误

2-根据建议此处

level_values = df.index.get_level_values
result = df.groupby( [ level_values(i) for i in [0,1] ] + [ pd.Grouper(freq='D', level=2) ] ).mean().rolling(window=2).mean()

这不会返回错误,但是似乎没有适当地对df进行重新采样/分组.结果似乎包含每小时的数据点,而不是每天:

this does not return an error but it does not seem to resample/group the df appropriately. Result seems to contain hourly data points, rather than daily:

print(result[:5])
                            a         b
  job_id datetime                      
0 job1   2017-06-23       NaN       NaN
1 job1   2017-06-23  0.831609  1.348970
2 job1   2017-06-23 -0.560047  1.063316
3 job1   2017-06-23 -0.641936 -0.199189
4 job1   2017-06-23  0.254402 -0.328190

推荐答案

首先让我们定义一个重采样函数:

First let's define a resampler function:

def resampler(x):    
    return x.set_index('datetime').resample('D').mean().rolling(window=2).mean()

然后,我们对job_id进行分组并应用重采样器功能:

Then, we groupby job_id and apply the resampler function:

 df.reset_index(level=2).groupby(level=1).apply(resampler)

Out[657]: 
                          a         b
job_id datetime                      
job1   2017-06-23       NaN       NaN
       2017-06-24  0.053378  0.004727
       2017-06-25  0.265074  0.234081
       2017-06-26  0.192286  0.138148
job2   2017-06-26       NaN       NaN
       2017-06-27 -0.016629 -0.041284
       2017-06-28 -0.028662  0.055399
       2017-06-29  0.113299 -0.204670
job3   2017-06-29       NaN       NaN
       2017-06-30  0.233524 -0.194982
       2017-07-01  0.068839 -0.237573
       2017-07-02 -0.051211 -0.069917

让我知道这是否是你的追求.

Let me know if this is what you are after.

这篇关于使用包含时间序列的多索引重新采样 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆