缺少日期的Pandas DataFrame MultiIndex groupby滚动操作 [英] Pandas DataFrame MultiIndex groupby rolling operation with missing dates
问题描述
我有一个具有MultiIndex的数据框,其中索引的最后一列是日期.我正在尝试对具有特定频率的列进行滚动操作.据我了解,如果我有一个TimeIndex,通常的熊猫方法是使用频率字符串来调用滚动函数(例如,如果我希望窗口为两天,则为"2D").建议的另一种方法是对TimeIndex进行重新采样,然后对整数2应用滚动函数.本质上,我要执行的操作是按除最后一个列之外的所有列进行分组,然后告诉滚动列将最后一个列用于特定于时间增量的滚动.下面是一个演示此情况的示例:
I have a dataframe which has a MultiIndex where the last column of the index is a date. I am trying to make a rolling operation on the columns with a specific frequency. As I understand it, the usual pandas approach if I had a TimeIndex would be to call the rolling function with a string of the frequency (for example '2D' if I wanted the window to be two days). Yet another approach suggested is to resample the TimeIndex and then apply rolling function with integer 2. Essentially what I want to be able to do is group by all the columns except for the last one and then tell the rolling column to use the last column for timedelta-specific rolling. Below is an example to demonstrate this:
from datetime import datetime
import pandas as pd
multi_index = pd.MultiIndex.from_tuples([
("A", datetime(2017, 1, 1)),
("A", datetime(2017, 1, 2)),
("A", datetime(2017, 1, 3)),
("A", datetime(2017, 1, 4)),
("B", datetime(2017, 1, 1)),
("B", datetime(2017, 1, 3)),
("B", datetime(2017, 1, 4))])
df = pd.DataFrame(index=multi_index, data={"colA": [1, 1, 1, 1, 1, 1, 1]})
display(df)
df.groupby([df.index.get_level_values(0), pd.Grouper(freq="1D", level=-1)]).sum().rolling(2).sum
上面的代码未为(B,datetime(2017,1,2))创建一行,因此滚动总和将全部为2.
The above code does not create a row for (B, datetime(2017, 1, 2)) and so the rolling sums will be all two.
解决这个问题的一种丑陋方法,实际上只有在有一群人全天都在滚动之前将其堆叠,填充并堆叠时,该方法才有效:
One ugly way to get around this, which really only works if there is a group which has all the days is to unstack, fillna and stack before rolling:
df.groupby([df.index.get_level_values(0), pd.Grouper(freq="1D", level=-1)]
).sum().unstack().fillna(0).stack().rolling(2).sum()
不用说这是一个丑陋的hack,缓慢且容易出错.有没有一种很好的方法可以在不进行大量操作的情况下实现我所需要的?理想的方法是告诉石斑鱼采取时间戳列或填写缺失值本身?
Needless to say this is an ugly hack, slow and error-prone. Is there a nice way achieve what I need here without extensive manipulation? Ideally some way to tell the grouper to take the timestamp column or fill missing values itself?
推荐答案
您可以使用 resample
+ fillna
-需要版本熊猫0.19.0 :
multi_index = pd.MultiIndex.from_tuples([
("A", datetime(2017, 1, 1)),
("A", datetime(2017, 1, 2)),
("A", datetime(2017, 1, 3)),
("A", datetime(2017, 1, 4)),
("B", datetime(2017, 1, 1)),
("B", datetime(2017, 1, 3)),
("B", datetime(2017, 1, 4))])
df = pd.DataFrame(index=multi_index, data={"colA": [1, 2, 3, 4, 1, 2, 3]})
print (df)
colA
A 2017-01-01 1
2017-01-02 2
2017-01-03 3
2017-01-04 4
B 2017-01-01 1
2017-01-03 2
2017-01-04 3
b = df.groupby(level=0).resample('1D', level=1).sum().fillna(0).rolling(2).sum()
print (b)
colA
A 2017-01-01 NaN
2017-01-02 3.0
2017-01-03 5.0
2017-01-04 7.0
B 2017-01-01 5.0
2017-01-02 1.0
2017-01-03 2.0
2017-01-04 5.0
这篇关于缺少日期的Pandas DataFrame MultiIndex groupby滚动操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!