pandas 每年高效的分组比 [英] Pandas efficient groupby season for every year

查看:111
本文介绍了 pandas 每年高效的分组比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个多年的时间序列,想知道我95%的数据所处的范围. 我想按一年中的季节("DJF","MAM","JJA","SON")进行查看.

I have a multi-year time series an want the bounds between which 95% of my data lie. I want to look at this by season of the year ('DJF', 'MAM', 'JJA', 'SON').

我尝试了以下操作:

import pandas as pd
import numpy as np
FRAC_2_TAIL = 0.025
yr_idx = pd.date_range(start='2005-01-30', 
                       end='2008-02-02', freq='D')
data = np.random.rand(len(yr_idx))
df = pd.DataFrame(index=yr_idx, data=data, columns=['a'])
month_num_to_season =   { 1:'DJF',  2:'DJF', 
                          3:'MAM',  4:'MAM',  5:'MAM', 
                          6:'JJA',  7:'JJA',  8:'JJA',
                          9:'SON', 10:'SON', 11:'SON',
                         12:'DJF'}
grouped =  df.groupby(lambda x: month_num_to_season.get(x.month))                      
low_bounds = grouped.quantile(FRAC_2_TAIL)
high_bounds = grouped.quantile(1 - FRAC_2_TAIL) 

它在给予的意义上起作用:

it works in the sense of giving:

DJF   0.021284
JJA   0.024769
MAM   0.030149
SON   0.041784

但是在我的分钟频率,十年数据集上花费了很长时间.

but takes a very long time on my minutely frequency, decade long, data sets.

我可以利用TimeGrouper几乎获得 我想要的东西:

I can make use of a TimeGrouper to get almost what I want:

gp_time = df.groupby(pd.TimeGrouper('QS-DEC'))
low_bounds = gp_time.agg(lambda x: x.quantile(FRAC_2_TAIL)) 

但是我们每年都有单独的输出(没有明显的方法可以合并多年来的分位数限制).

but we have separate output for each year (with no obvious way to combine quantile limits over the years).

2004-12-01  0.036755
2005-03-01  0.034271
         ...
2007-09-01  0.098833
2007-12-01  0.068948

我还尝试制作一个freq='QS-DEC'时间序列"DJF","MAM"等,以最大程度地减少字典查找,然后升采样至df.index.freq并对其进行分组.它也很慢,而且占用大量内存.

I've also tried making a freq='QS-DEC' time-series 'DJF', 'MAM' etc. to minimize the dictionary lookups, then upsampling to df.index.freq and grouping on that. It is slow and memory-heavy too.

似乎我缺少明显的东西.

It seems like I'm missing something obvious.

根据@JohnE的评论

in light of @JohnE's comment

需要花费时间的是groupby中的dict查找.使用5年的分钟数据:

It is the dict lookup in the groupby that is taking time. Using 5 years of minutely data:

%%timeit
grouped =  df.groupby(lambda x: month_num_to_season.get(x.month)) 
> 13.3 s per loop

分位数计算速度很快:

%%timeit
low_bounds = grouped.quantile(FRAC_2_TAIL)
> 2.94 ms per loop

添加季节列并对其进行分组在总体时间上是相似的.再次由dict查找所控制:

Adding a season column and grouping on that is similar in overall timing. Again dominated by the dict lookup`:

SEAS = 'season'
%%timeit
df[SEAS] = [month_num_to_season.get(t_stamp.month) for t_stamp in df.index]
> 13.1 s per loop

%%timeit
gp_on_col = df.groupby(SEAS)
> 10000 loops, best of 3: 62.7 µs per loop

%%timeit
gp_on_col.quantile(FRAC_2_TAIL)
> 753 ms per loop

我重新实现了制作季度季节数据帧的方法,以最小化dict查找,然后对其进行上采样.现在,这种方法看起来有很大的改进:我不知道以前如何使它变得如此缓慢:

I re-implemented the method of making a quarterly season dataframe to minimize the dict lookups then up-sampling that. This method is now looking like a substantial improvement: I do not know how I had made it so slow before:

SEASON_HALO = pd.datetools.relativedelta(months=4)
start_with_halo = df.index.min() - SEASON_HALO
end_with_halo = df.index.max() + SEASON_HALO
> 84.1 µs per loop

seasonal_idx = pd.DatetimeIndex(start=start_with_halo, end=end_with_halo, freq='QS-DEC')
seasonal_ts = pd.DataFrame(index=seasonal_idx)
> 440 µs per loop

seasonal_ts[SEAS] = [month_num_to_season.get(t_stamp.month) for t_stamp in seasonal_ts.index]
> 1.25 s per loop

seasonal_minutely_ts = seasonal_ts.resample(df.index.freq, fill_method='ffill')
> 5.12 ms per loop

df_via_resample = df.join(seasonal_minutely_ts)
> 47 ms per loop

gp_up_sample = df_via_resample.groupby(SEAS)
> 63.4 µs per loop

gp_up_sample.quantile(FRAC_2_TAIL)
> 834 ms per loop

这大约是2秒,而其他方法是13秒.

That is something like 2 sec vs 13 sec for the other methods.

推荐答案

如果有帮助,我建议您替换以下列表理解和dict查找为慢的字典查找:

In case it helps, I would suggest replacing the following list comprehension and dict lookup that you identified as slow:

month_to_season_dct = {
    1: 'DJF', 2: 'DJF',
    3: 'MAM', 4: 'MAM', 5: 'MAM',
    6: 'JJA', 7: 'JJA', 8: 'JJA',
    9: 'SON', 10: 'SON', 11: 'SON',
    12: 'DJF'
}
grp_ary = [month_to_season_dct.get(t_stamp.month) for t_stamp in df.index]

具有以下内容,它使用numpy数组作为查找表.

with the following, which uses a numpy array as a lookup table.

month_to_season_lu = np.array([
    None,
    'DJF', 'DJF',
    'MAM', 'MAM', 'MAM',
    'JJA', 'JJA', 'JJA',
    'SON', 'SON', 'SON',
    'DJF'
])
grp_ary = month_to_season_lu[df.index.month]

下面是这两种方法在约3年的分钟数据上的时间比较:

Here's a timeit comparison of the two approaches on ~3 years of minutely data:

In [16]: timeit [month_to_season_dct.get(t_stamp.month) for t_stamp in df.index]
1 loops, best of 3: 12.3 s per loop

In [17]: timeit month_to_season_lu[df.index.month]
1 loops, best of 3: 549 ms per loop

这篇关于 pandas 每年高效的分组比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆