列日期的1年滚动平均 pandas [英] 1 Year Rolling mean pandas on column date
问题描述
我想计算以下数据框上每一行的1年滚动平均值
I would like to compute the 1 year rolling average for each line on the Dataframe below
测试:
index id date variation
2313 7034 2018-03-14 4.139148e-06
2314 7034 2018-03-13 4.953194e-07
2315 7034 2018-03-12 2.854749e-06
2316 7034 2018-03-09 3.907458e-06
2317 7034 2018-03-08 1.662412e-06
2318 7034 2018-03-07 1.346433e-06
2319 7034 2018-03-06 8.731700e-06
2320 7034 2018-03-05 7.145597e-06
2321 7034 2018-03-02 4.893283e-06
...
例如,我需要计算:
- 2018年3月14日至2017年8月14日之间7034的变化平均值
-
2018-03-13至2017-08-13之间7034的变化平均值
- mean of variation of 7034 between 2018-03-14 and 2017-08-14
mean of variation of 7034 between 2018-03-13 and 2017-08-13
等
我尝试过:
test.groupby(['id','date'])['variation'].rolling(window=1,freq='Y',on='date').mean()
但是我收到错误消息:
ValueError: invalid on specified as date, must be a column (if DataFrame) or None
这种情况下如何使用熊猫rolling()
函数?
How can I use the pandas rolling()
function is this case?
谢谢Sacul
我测试过:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date').groupby('id').rolling(window=1, freq='Y').mean()['variation']
但是freq='Y'
不起作用(我得到了:ValueError: Invalid frequency: Y
)然后我使用了window = 365, freq = 'D'
.
But freq='Y'
doesn't work (I got: ValueError: Invalid frequency: Y
) Then I used window = 365, freq = 'D'
.
但是还有另一个问题:因为每个夫妇id-date永远不会有365个连续日期,所以结果始终为空.即使缺少日期,我也想忽略它们,并考虑当前日期和当前日期之间的所有日期-365以计算滚动平均值.例如,假设我有:
But there is another issue: because there is never 365 consecutive dates for each couple id-date, the result is always empty. Even if there missing dates, I would like to ignore them and consider all dates between the current date and the current date - 365 to compute the rolling mean. for instance, imagine I have:
index id date variation
2313 7034 2018-03-14 4.139148e-06
2314 7034 2018-03-13 4.953194e-07
2315 7034 2017-03-13 2.854749e-06
然后
- 对于7034 2018-03-14:我想计算MEAN(4.139148e-06,4.953194e-07,2.854749e-06)
- 对于7034 2018-03-13:我还要计算MEAN(4.139148e-06,4.953194e-07,2.854749e-06)
我该怎么做?
最后,我使用下面的公式通过忽略缺失值来计算1年的滚动中位数,平均值和标准差:
Finaly I used the formula below to calculate rolling median, averages and standard deviation on 1 Year by ignoring missing values:
pd.rolling_median(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)
pd.rolling_mean(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)
pd.rolling_std(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)
非常感谢您的帮助!
托马斯
推荐答案
我相信这应该对您有用:
I believe this should work for you:
# First make sure that `date` is a datetime object:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date').groupby('id').rolling(window=1, freq='A').mean()['variation']
当date
是索引时,将df.set_index('date')
的原因(如
using pd.DataFrame.rolling
with datetime works well when the date
is the index, which is why I used df.set_index('date')
(as can be seen in one of the documentation's examples)
在您的示例数据帧中,我无法真正测试它是否适用于当年平均值,因为只有一年且只有一个ID,但它应该可以工作.
I can't really test if it works on the year's average on your example dataframe, as there is only one year and only one ID, but it should work.
正如Mihai-Andrei Dinculescu指出的那样,freq
现在已被弃用.这是完成您正在寻找的事情的另一种方法(并且可能更适用于未来):
As pointed out by Mihai-Andrei Dinculescu, freq
is now a deprecated argument. Here is an alternative (and probably more future-proof) way to do what you're looking for:
df.set_index('date').groupby('id')['variation'].resample('A').mean()
您可以查看关于频率参数的此链接.
这篇关于列日期的1年滚动平均 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!