Python DataFrame从每日数据中选择按月递增的行 [英] Python DataFrame selecting the rows with monthly increment from daily data

查看:350
本文介绍了Python DataFrame从每日数据中选择按月递增的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们直接解决这个问题.以下是每日数据:

Let's get right into the question. The following is the daily data:

             AAA    BBB    CCC
date                           
2012-04-16  44.48  28.48  17.65
2012-04-17  44.59  28.74  17.65
2012-04-18  44.92  28.74  17.72
2012-04-19  44.92  28.62  17.72
2012-04-20  45.09  28.68  17.71
2012-04-23  45.09  28.40  17.76
2012-04-24  45.09  28.51  17.73
2012-04-25  45.01  28.76  17.73
2012-04-26  45.40  28.94  17.76
2012-04-27  45.57  29.02  17.79
2012-04-30  45.45  28.90  17.80
2012-05-01  45.79  29.07  17.80
2012-05-02  45.71  28.98  17.77
2012-05-03  45.44  28.81  17.79
2012-05-04  45.05  28.48  17.79
2012-05-07  45.05  28.48  17.79
2012-05-08  45.00  28.40  17.93
2012-05-09  44.87  28.30  17.94
2012-05-10  44.93  28.34  17.85
2012-05-11  44.86  28.30  17.96
           ...    ...    ...

我想从第一行开始选择具有每月增量的行,即索引为 2012-04-16、2012-05-16、2012的行-06-16,... .我可以只使用relativedelta并手动添加它们,但是我想知道是否有更有效的方法.我尝试重采样,但是只能选择每个月的第一天或最后一天,如df.resample('M').first()所示.

I want to select the rows starting from the first row with a monthly increment, that is, the rows whose index is 2012-04-16, 2012-05-16, 2012-06-16, ... . I can just use relativedelta and manually add them but I'm wondering if there is a more efficient method. I tried resampling, but I could only choose the first or last of each month as in df.resample('M').first().

使问题更加复杂的是,某些日期丢失了.它们是工作日,但不是美国的工作日.有几种方法可以解决此问题:

What makes the problem more complicated is that some of the dates are missing; they are business days but not those of U.S.. There are several ways to handle this problem:

  1. 选择确切日期或最接近日期的较早日期.如果 这样的日期不存在,然后开始查找以后的日期.

  1. Choose the exact date or the earlier one closest to the date. If such date is nonexistent, then start looking up for the later dates.

选择确切的日期或最接近日期的较晚的日期.如果这样的话 日期不存在,然后开始查找较早的日期.

Choose the exact date or the later one closest to the date. If such date is nonexistent, then start looking up for the earlier dates.

选择最接近确切日期的日期,而不管早​​到 或迟到;我可以使用min(df.index, key=lambda x: abs(x - (df.index[0] + relativedelta(months=1))).

Choose the closest date to the exact date regardless of being early or late; I can use min(df.index, key=lambda x: abs(x - (df.index[0] + relativedelta(months=1))).

在每种情况下,我都想知道哪种方法最有效,最易读.在上一个代码示例中,month是一个变量,因此我不确定是否可以将其作为lambda过程并使用"apply".

And in each of these cases, I wonder which method is the most efficient and easy to read. In the last code example, the month is a variable so I'm not sure if I can make it as a lambda procedure and use 'apply'.

谢谢.

推荐答案

在查看您的数据之前,我们首先来看一下如何在每个月的特定日期创建DatetimeIndex.由于常规的 pd.date_range 每月使用一次在每个月的最后天,我们可以简单地添加固定天数:

Before we look at your data, let's first see how we can create a DatetimeIndex for a specific day of each month. Since the regular pd.date_range with monthly frequency takes the last day of each month, we can simply add a fixed number of days:

idx = pd.date_range('2018-04-01', '2018-07-01', freq='1M') + pd.DateOffset(days=16)

DatetimeIndex(['2018-05-16', '2018-06-16', '2018-07-16'],
              dtype='datetime64[ns]', freq=None)

现在让我们来看一个数据框示例,该数据框缺少第16 天:

Now let's take an example dataframe which has some 16th days missing:

              AAA    BBB    CCC
date                           
2012-04-16  44.48  28.48  17.65
2012-04-17  44.59  28.74  17.65
2012-05-15  45.79  29.07  17.80
2012-05-16  45.71  28.98  17.77
2012-05-17  45.44  28.81  17.79
2012-06-15  44.87  28.30  17.94
2012-06-17  44.95  28.50  17.98
2012-07-14  44.65  28.25  17.87
2012-07-17  44.55  28.75  17.75

正如您提到的,您可以通过多种方式来决定如何选择不匹配的日期,既可以倒退,前进,也可以无条件查找最接近的日期. 您需要考虑在项目环境中最合适的方法.下面是一个坚持Pandas功能并避免使用自定义lambda功能的解决方案.

As you mention, there are a number of ways you can decide on how to select non-matching days, either go backwards, forwards, or look for nearest with no preference. You need to consider what's most appropriate in the context of your project. Below is a solution which sticks to Pandas functionality and avoids custom lambda functions.

首先创建一个仅指定必需索引的数据框:

First create a dataframe with only required indices specified:

offset = pd.DateOffset(days=16)
start_date = df.index[0]-pd.DateOffset(months=1)
idx = pd.date_range(start_date, df.index[-1], freq='1M') + offset

df_idx = pd.DataFrame(index=idx)

请注意,我们需要从开始参数中减去一个月,以便在添加16天后不会遗漏第一个月.现在,您可以将 pd.merge_asof 用作多种选项:-

Notice we need to subtract a month from the start argument, so that the first month is not omitted after adding 16 days. Now you can use pd.merge_asof with a variety of options:-

direction参数指定为'backward'(默认值),'forward''nearest'(视情况而定).例如,使用'forward':

Specify direction argument as 'backward' (default), 'forward' or 'nearest' as appropriate. For example, using 'forward':

print(pd.merge_asof(df_idx, df, left_index=True, right_index=True, direction='forward'))

              AAA    BBB    CCC
2012-04-16  44.48  28.48  17.65
2012-05-16  45.71  28.98  17.77
2012-06-16  44.95  28.50  17.98
2012-07-16  44.55  28.75  17.75

这现在可能足以满足您的需求.

This now may be sufficient for your needs.

编辑:如果要保留数据框中的索引,可以反转合并的方向,并使用'backward'而不是'forward':

If you want to keep the index from the dataframe, you can reverse the direction of the merge and use 'backward' instead of 'forward':

res = pd.merge_asof(df.reset_index(),
                    df_idx.reset_index().rename(columns={'index': 'date_idx'}),
                    left_on='date', right_on='date_idx', direction='backward')

res['diff'] = (res['date'] - res['date_idx']).dt.days.abs()
grouper = res['date'].dt.strftime('%Y-%m')
res = res[res['diff'] == res.groupby(grouper)['diff'].transform('min')]

print(res)

        date    AAA    BBB    CCC   date_idx  diff
0 2012-04-16  44.48  28.48  17.65 2012-04-16     0
3 2012-05-16  45.71  28.98  17.77 2012-05-16     0
6 2012-06-17  44.95  28.50  17.98 2012-06-16     1
8 2012-07-17  44.55  28.75  17.75 2012-07-16     1

这篇关于Python DataFrame从每日数据中选择按月递增的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆