pandas 重新采样到特定的工作日 [英] pandas resample to specific weekday in month

查看:28
本文介绍了 pandas 重新采样到特定的工作日的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Pandas 数据框,我想在每个月的第三个星期五重新采样.

I have a Pandas dataframe where I'd like to resample to every third Friday of the month.

np.random.seed(0)
#requested output:
dates = pd.date_range("2018-01-01", "2018-08-31")
dates_df = pd.DataFrame(data=np.random.random(len(dates)), index=dates)
mask = (dates.weekday == 4) & (14 < dates.day) & (dates.day < 22)
dates_df.loc[mask]

但是当缺少第三个星期五时(例如放弃 2 月 3 日星期五),我想获得最新值(截至 2018-02-15).使用掩码给我下一个值(2 月 17 日而不是 2 月 15 日):

But when a third Friday is missing (e.g. dropping Feb third Friday), I want to have the latest value (so as of 2018-02-15). Using the mask gives me the next value (Feb 17 instead of Feb 15):

# remove February third Friday:
dates_df = dates_df.drop([pd.to_datetime("2018-02-16")])
mask = (dates.weekday == 4) & (14 < dates.day) & (dates.day < 22)
dates_df.loc[mask]

将月度重采样与 loffset 结合使用给出月末值并抵消索引,这也不是我想要的:

Using monthly resample in combination with loffset gives the end of month values with offsetting the index, which is also not what I want:

from pandas.tseries.offsets import WeekOfMonth
dates_df.resample("M", loffset=WeekOfMonth(week=2, weekday=4)).last()

是否有替代方法(最好使用重新采样)而不必先重新采样到每日值然后添加掩码(这需要很长时间才能在我的数据帧上完成)

Is there an alternative (preferably using resample) without having to resample to daily values first and then adding a mask (this takes a long time to complete on my dataframe)

推荐答案

你的第二次尝试是在正确的方向 IIUC,你只需要使用 WeekOfMonth 作为规则重新采样,而不是将其用作偏移量:

Your second attempt is in the right direction IIUC, you just need to resample using WeekOfMonth as the rule, rather than using it as an offset:

dates_df.resample(WeekOfMonth(week=2, weekday=4)).asfreq().dropna()

这种方法不会抵消索引,它应该只返回每个月第三个星期五的数据.

This approach will not offset the index, it should just return the data for the third Friday for every month.

处理错过的第三个星期五:

Dealing with Missing 3rd Friday:

使用上面的代码,如果您错过了第三个星期五,则整个月都将被排除在外.但是根据你想如何处理丢失的数据,你可以bfillffillpad..你可以把上面的修改成下面的:

With the above code, if you have a missing 3rd Friday the whole month will be excluded. But depending on how you want to deal with missing data, you can bfill, ffill, pad.. you can amend the above to the following:

dates_df.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()

以上将用下一个值bfill缺失的第三个星期五.

The above will bfill the missing 3rd Friday with the next value.

更新:让我们使用固定数据集而不是 np.random:

Update: Lets work with a fixed data set instead of np.random:

# create a smaller daterange
dates = pd.date_range("2018-05-01", "2018-08-31")

# create a data with only 1,2,3 values
data = [1,2,3] * int(len(dates)/3)

dates_df = pd.DataFrame(data=data, index=dates)
dates_df.head()

# Output:

2018-05-01  1
2018-05-02  2
2018-05-03  3
2018-05-04  1
2018-05-05  2

现在让我们手动选择每个月的第三个星期五的数据:

Now let's check what the data looks like for the 3rd Friday of each month by selecting it manually:

dates_df.loc[[
    pd.Timestamp('2018-05-18'),
    pd.Timestamp('2018-06-15'),
    pd.Timestamp('2018-07-20'),
    pd.Timestamp('2018-08-17')
]]

Output:

2018-05-18  3
2018-06-15  1
2018-07-20  3
2018-08-17  1

如果您没有遗漏第三个星期五并运行之前提供的代码:

If you dont have any missing 3rd Fridays and running the code provided earlier:

dates_df.resample(rule=WeekOfMonth(week=2,weekday=4)).asfreq().dropna()

将产生以下输出:

2018-05-18  3
2018-06-15  1
2018-07-20  3
2018-08-17  1

正如您所看到的,这里的索引没有移动,它返回了每个月第三个星期五的确切值.

As you can see the index has not been shifted here and it returned the exact values for the 3rd Friday of each month.

现在假设您确实缺少第三个星期五,这取决于您想如何去做(使用前一个值:ffill,或下一个值 bfill):

Now say you do have some 3rd Fridays missing, depending how you want to do it (use previous value: ffill, or next value bfill):

  • pad/ffill:将最后一个有效观察结果传播到下一个有效观察
  • backfill/bfill:使用 NEXT 有效观察来填补空白

dates_df.drop(index=pd.Timestamp('2018-08-17')).resample(rule=WeekOfMonth(week=2, weekday=4)).ffill().asfreq(freq='D').dropna()

2018-05-18  3
2018-06-15  1
2018-07-20  3
2018-08-17  3

dates_df.drop(index=pd.Timestamp('2018-08-17')).resample(rule=WeekOfMonth(week=2, weekday=4)).bfill().asfreq(freq='D').dropna()

2018-04-20  1
2018-05-18  3
2018-06-15  1
2018-07-20  3
2018-08-17  2

如果说整个索引像你的例子一样移动:

If say the whole index was shifted like your example:

dates_df.resample(rule='M', loffset=WeekOfMonth(week=2, weekday=4)).asfreq().dropna()

# Output:

2018-06-15  1
2018-07-20  1
2018-08-17  2
2018-09-21  3

发生的事情是您按照规则M"(月末)重新采样,然后在每个月的第 3 个星期五之前抵消(向前移动)指数.

Whats happening there is you're resampling by rule 'M' (month end) and then you're offsetting (shifting forward) the index by the 3rd Friday of each Month.

正如您在偏移之前所看到的,它是这样的:

As you can see before the offset, this how it looks like:

dates_df.resample(rule='M').asfreq().dropna()

# Output

2018-05-31  1
2018-06-30  1
2018-07-31  2
2018-08-31  3

这篇关于 pandas 重新采样到特定的工作日的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆