Python PANDAS:使用 Groupby 重采样多元时间序列 [英] Python PANDAS: Resampling Multivariate Time Series with a Groupby
问题描述
我有以下通用格式的数据,我想重新采样到 30 天时间序列窗口:</p>
I have data in the following general format that I would like to resample to 30 day time series windows:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
我希望 30 天的窗口从 2014 年 1 月 1 日开始,到 2018 年 12 月 31 日结束.不保证每个客户都会在每个窗口都有记录.如果客户在一个窗口中进行了多次交易,那么它会取价格的加权平均值,将单位相加,并连接产品名称,以在每个窗口中为每个客户创建一条记录.
I would like the 30 day windows to start on 2014-01-01 and end on 12-31-2018. It is NOT guaranteed that every customer will have records in every window. If a customer has multiple transactions in a window, then it takes the weighted average of the price, sums the units, and concat the product names to create one record per customer per window.
到目前为止我所拥有的是这样的:
What I have so far is something like this:
wa = lambda x:np.average(x, weights=df.loc[x.index, 'units'])
con = lambda x: '/'.join(x))
agg_funcs = {'customer_id':'first',
'product':'con',
'price':'wa',
'transaction_dt':'first',
'units':'sum'}
df_window = df.groupby(['customer_id', pd.Grouper(freq='30D')]).agg(agg_funcs)
df_window_final = df_window.unstack('customer_id', fill_value=0)
如果有人知道一些更好的方法来解决这个问题(特别是使用就地和/或矢量化方法),我将不胜感激.理想情况下,我还想将窗口开始和停止日期作为列添加到行中.
If anyone knows some better ways to approach this problem (particularly with an in-place and/or vectorized method), I would appreciate it. Ideally, I would also like to add the window start and stop dates as columns to the rows as well.
理想情况下,最终输出将如下所示:
The final output would look like this ideally:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
1,2004-01-02,thing1/thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
2,2004-01-29,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2017-07-15,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2016-05-12,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2012-02-23,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2009-10-10,thing1,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2014-04-04,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
5,2008-07-09,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
推荐答案
针对新解决方案进行了编辑.我认为您可以将每个 transaction_dt
转换为 30 天的 Period 对象,然后进行分组.
Edited for new solution. I think you can convert each of the transaction_dt
to a Period object of 30 days and then do the grouping.
p = pd.period_range('2004-1-1', '12-31-2018',freq='30D')
def find_period(v):
p_idx = np.argmax(v < p.end_time)
return p[p_idx]
df['period'] = df['transaction_dt'].apply(find_period)
df
customer_id transaction_dt product price units period
0 1 2004-01-02 thing1 25 47 2004-01-01
1 1 2004-01-17 thing2 150 8 2004-01-01
2 2 2004-01-29 thing2 150 25 2004-01-01
3 3 2017-07-15 thing3 55 17 2017-06-21
4 3 2016-05-12 thing3 55 47 2016-04-27
5 4 2012-02-23 thing2 150 22 2012-02-18
6 4 2009-10-10 thing1 25 12 2009-10-01
7 4 2014-04-04 thing2 150 2 2014-03-09
8 5 2008-07-09 thing2 150 43 2008-07-08
我们现在可以使用这个数据框来获得产品的串联、价格的加权平均和单位总和.然后我们使用一些 Period 功能来获取结束时间.
We can now use this dataframe to get the concatenation of products, weighted average of price and sum of units. We then use some of the Period functionality to get the end time.
def my_funcs(df):
data = {}
data['product'] = '/'.join(df['product'].tolist())
data['units'] = df.units.sum()
data['price'] = np.average(df['price'], weights=df['units'])
data['transaction_dt'] = df['transaction_dt'].iloc[0]
data['window_start_time'] = df['period'].iloc[0].start_time
data['window_end_time'] = df['period'].iloc[0].end_time
return pd.Series(data, index=['transaction_dt', 'product', 'price','units',
'window_start_time', 'window_end_time'])
df.groupby(['customer_id', 'period']).apply(my_funcs).reset_index('period', drop=True)
这篇关于Python PANDAS:使用 Groupby 重采样多元时间序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!