Pandas 将每月数据重新采样为组内每周数据并拆分值 [英] Pandas Resample Monthly data to Weekly within Groups and Split Values

查看:59
本文介绍了Pandas 将每月数据重新采样为组内每周数据并拆分值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,如下:

I have a dataframe, below:

ID Date     Volume Sales
1  2020-02   10     4
1  2020-03   8      6
2  2020-02   6      8
2  2020-03   4      10

是否有一种简单的方法可以使用重采样将其转换为每周数据?并将销量和销售额列除以当月的周数?

Is there an easy way to convert this to weekly data using resampling? And dividing the volume and sales column by the number of weeks in the month?

我已经开始了我的流程,其代码如下:

I have started my process which code which looks like:

import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('date')
grouped = df.groupby('ID').resmaple('W').ffill().reset_index() 
print(grouped)

执行此步骤后,我收到一条错误消息:无法插入 ID,已存在

After this step, I get an error message: cannot inset ID, already exists

还有用于查找一个月中的周数的代码,用于将数量和销售额列除以当月的周数.

Also is there a code to use for finding the number of weeks in a month for dividing the volume and sales column by the number of weeks in the month.

预期输出是:

ID      Volume  Sales      Weeks
0   1      2.5    1.0     2020-02-02
0   1      2.5    1.0     2020-02-09
0   1      2.5    1.0     2020-02-16
0   1      2.5    1.0     2020-02-23
1   1      1.6    1.2     2020-03-01
1   1      1.6    1.2     2020-03-08
1   1      1.6    1.2     2020-03-15
1   1      1.6    1.2     2020-03-22
1   1      1.6    1.2     2020-03-29
2   2      1.5    2       2020-02-02
2   2      1.5    2       2020-02-09
2   2      1.5    2       2020-02-16
2   2      1.5    2       2020-02-23
3   2      0.8    2       2020-03-01
3   2      0.8    2       2020-03-08
3   2      0.8    2       2020-03-15
3   2      0.8    2       2020-03-22
3   2      0.8    2       2020-03-29

推荐答案

经过审核,可以使用更简单的解决方案.请参阅下面第 1 部分中标有新解决方案的小节.

After review, a much simpler solution can be used. Please refer to subsection labeled New Solution in Part 1 below.

此任务需要多个步骤.我们将其分解如下:

This task requires multiple steps. Let's break it down as follows:

新解决方案

考虑到所需的每周频率,以星期日为基础(即 freq='W-SUN')对于每个月来说都是独立的,并且与任何相邻的月份无关或受其影响,我们可以直接使用 Date 列中的年月值来一步生成每周的日期范围,而不是通过首先从年月生成每日日期范围然后重新采样每天的日期范围到之后的每周.

With consideration that the weekly frequency required, being Sunday based (i.e. freq='W-SUN') is independent for each month and is not related to or affected by any adjacent month(s), we can directly use the year-month values in column Date to generate date ranges in weekly basis in one step rather than breaking into 2 steps by first generating daily date ranges from year-month and then resample the daily date ranges to weekly afterwards.

新的程序逻辑只需要使用pd.在 pd.offsets.MonthEnd() 生成一个月的每周频率.总而言之,它不需要像其他解决方案一样调用 .resample().asfreq() .实际上,pd.date_range()freq='W' 正在为我们完成重采样任务.

The new program logics just needs to use pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month. Altogether, it does not need to call .resample() or .asfreq() like other solutions. Effectively, the pd.date_range() with freq='W' is doing the resampling task for us.

代码如下:

df['Weeks'] = df['Date'].map(lambda x: 
                             pd.date_range(
                                 start=pd.to_datetime(x), 
                                 end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
                                 freq='W'))

df = df.explode('Weeks')


结果:


Result:

print(df)


   ID     Date  Volume  Sales      Weeks
0   1  2020-02      10      4 2020-02-02
0   1  2020-02      10      4 2020-02-09
0   1  2020-02      10      4 2020-02-16
0   1  2020-02      10      4 2020-02-23
1   1  2020-03       8      6 2020-03-01
1   1  2020-03       8      6 2020-03-08
1   1  2020-03       8      6 2020-03-15
1   1  2020-03       8      6 2020-03-22
1   1  2020-03       8      6 2020-03-29
2   2  2020-02       6      8 2020-02-02
2   2  2020-02       6      8 2020-02-09
2   2  2020-02       6      8 2020-02-16
2   2  2020-02       6      8 2020-02-23
3   2  2020-03       4     10 2020-03-01
3   2  2020-03       4     10 2020-03-08
3   2  2020-03       4     10 2020-03-15
3   2  2020-03       4     10 2020-03-22
3   2  2020-03       4     10 2020-03-29

通过上面的两行代码,我们已经得到了Part 1需要的结果,不需要再经过复杂的.groupby().resample代码() 在旧解决方案中.

By the 2 lines of codes above, we already get the required result for Part 1. We don't need to go through the complicated codes of .groupby() and .resample() in the old solution.

我们可以继续到第 2 部分.由于我们还没有创建 grouped 对象,我们可以在中将 grouped 替换为 df对于第 2 部分中的代码或添加新行 grouped = df 以继续.

We can continue to go to Part 2. As we have not created the grouped object, we can either replace grouped by df in for the codes in Part 2 or add a new line grouped = df to continue.

旧解决方案

我们使用 pd.date_range()pd.offsets.MonthEnd() 生成整个月的每日条目.然后将这些完整的月份范围转换为索引,然后重新采样为周频率.使用 closed='left' 重新采样以排除在默认情况下生成的 2020-04-05 不需要的一周 resample() 参数.

We use pd.date_range() with freq='D' with the help of pd.offsets.MonthEnd() to produce daily entries for the full month. Then transform these full month ranges to index before resampling to week frequency. Resampled with closed='left' to exclude the unwanted week of 2020-04-05 produced under default resample() parameters.

df['Weeks'] = df['Date'].map(lambda x: 
                             pd.date_range(
                                 start=pd.to_datetime(x), 
                                 end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
                                 freq='D'))

df = df.explode('Weeks').set_index('Weeks')

grouped = (df.groupby(['ID', 'Date'], as_index=False)
             .resample('W', closed='left')
             .ffill().dropna().reset_index(-1))


结果:


Result:

print(grouped)


       Weeks   ID     Date  Volume  Sales
0 2020-02-02  1.0  2020-02    10.0    4.0
0 2020-02-09  1.0  2020-02    10.0    4.0
0 2020-02-16  1.0  2020-02    10.0    4.0
0 2020-02-23  1.0  2020-02    10.0    4.0
1 2020-03-01  1.0  2020-03     8.0    6.0
1 2020-03-08  1.0  2020-03     8.0    6.0
1 2020-03-15  1.0  2020-03     8.0    6.0
1 2020-03-22  1.0  2020-03     8.0    6.0
1 2020-03-29  1.0  2020-03     8.0    6.0
2 2020-02-02  2.0  2020-02     6.0    8.0
2 2020-02-09  2.0  2020-02     6.0    8.0
2 2020-02-16  2.0  2020-02     6.0    8.0
2 2020-02-23  2.0  2020-02     6.0    8.0
3 2020-03-01  2.0  2020-03     4.0   10.0
3 2020-03-08  2.0  2020-03     4.0   10.0
3 2020-03-15  2.0  2020-03     4.0   10.0
3 2020-03-22  2.0  2020-03     4.0   10.0
3 2020-03-29  2.0  2020-03     4.0   10.0

这里,我们保留列 Date 以备后用.

Here, we retain the column Date for some use later.

此处,用于划分销量和销售额数字的周数实际上应该是该月内重新采样的周数,如上面的中期结果所示.

Here, the number of weeks in month used to divide the Volume and Sales figures should actually be the number of resampled weeks within the month as shown in the interim result above.

如果我们使用实际周数,那么对于 2020 年 2 月,由于闰年,该月有 29 天,因此它实际上跨越 5 周,而不是上述中期结果中的 4 周重新采样.那么就会导致结果不一致,因为上面只有 4 个星期的条目,而我们将每个 Volume 和 Sales 数字除以 5.

If we use the actual number of weeks, then for Feb 2020, because of leap year, it has 29 days in that month and thus it actually spans across 5 weeks instead of the 4 resampled weeks in the interim result above. Then it would cause inconsistent results because there are only 4 week entries above while we divide each Volume and Sales figure by 5.

然后让我们转到代码:

我们按列IDDate 分组,然后将VolumeSales 列中的每个值按组划分大小(即重新采样的周数).

We group by columns ID and Date and then divide each value in columns Volume and Sales by group size (i.e. number of resampled weeks).

grouped[['Volume', 'Sales']] = (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
                                       .transform(lambda x: x / x.count()))

或使用 /= 的简化形式如下:

or simplified form using /= as follows:

grouped[['Volume', 'Sales']] /= (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
                                        .transform('count'))


结果:


Result:

print(grouped)


       Weeks   ID     Date  Volume  Sales
0 2020-02-02  1.0  2020-02     2.5    1.0
0 2020-02-09  1.0  2020-02     2.5    1.0
0 2020-02-16  1.0  2020-02     2.5    1.0
0 2020-02-23  1.0  2020-02     2.5    1.0
1 2020-03-01  1.0  2020-03     1.6    1.2
1 2020-03-08  1.0  2020-03     1.6    1.2
1 2020-03-15  1.0  2020-03     1.6    1.2
1 2020-03-22  1.0  2020-03     1.6    1.2
1 2020-03-29  1.0  2020-03     1.6    1.2
2 2020-02-02  2.0  2020-02     1.5    2.0
2 2020-02-09  2.0  2020-02     1.5    2.0
2 2020-02-16  2.0  2020-02     1.5    2.0
2 2020-02-23  2.0  2020-02     1.5    2.0
3 2020-03-01  2.0  2020-03     0.8    2.0
3 2020-03-08  2.0  2020-03     0.8    2.0
3 2020-03-15  2.0  2020-03     0.8    2.0
3 2020-03-22  2.0  2020-03     0.8    2.0
3 2020-03-29  2.0  2020-03     0.8    2.0

或者,如果您愿意,您可以做一些修饰工作以删除列 Date 并重新排列列 Weeks 到您想要的位置.

Optionally, you can do some cosmetic works to drop the column Date and rearrange column Weeks to your desired position if you like.

在本次评测中,我搜索了一些类似标题的其他问题,并比较了问题和解决方案.

In this review, I have searched some other questions of similar titles and compared the questions and solutions.

另一个问题具有类似的要求,即平均分配每月值根据重新采样的月份中的周数更改为每周值.在该问题中,月份表示为月份的第一个日期,它们采用日期时间格式并用作数据帧中的索引,而在此问题中,月份表示为 YYYY-MM 可以是字符串类型.

There is another question with similar requirement to split the monthly values equally to weekly values according to the number of weeks in the resampled month. In that question, the months are represented as the first date of the months and they are in datetime format and used as index in the dataframe while in this question, the months are represented as YYYY-MM which can be of string type.

一个很大的关键区别在于,在该问题中,实际上未处理值为 22644 的上个月期间索引 2018-05-01. 也就是说,未处理 2018-05 月份在 2018 年 5 月重新采样为周,并且从未处理过值 22644 以拆分为每周比例.使用 .asfreq() 的已接受解决方案根本不显示 2018-05 的任何条目,而使用 .resample() 的其他解决方案仍然保留一个(未重新采样)2018-05 的条目,值 22644 未拆分为每周比例.

A big and critical difference is that in that question, the last month period index 2018-05-01 with value 22644 was actually not processed. That is, the month of 2018-05 is not resampled into weeks in May 2018 and the value 22644 has never been processed to split into weekly proportions. The accepted solution using .asfreq() does not show any entry for 2018-05 at all and the other solution using .resample() still keeps one (un-resampled) entry for 2018-05 and the value 22644 is not split into weekly proportions.

但是,在我们这里的问题中,每个组中列出的最后一个月仍然需要重新采样到几周内,并且重新采样的几周内的值均分.

However, in our question here, the last month listed in each group still needs to be resampled into weeks and values split equally for the resampled weeks.

查看解决方案,我的新解决方案没有调用 .resample().asfreq().它只使用 pd.date_range() 使用 freq='W'pd.offsets.MonthEnd() 根据YYYY-MM"值生成一个月的每周频率.这是我在使用 .resample()

Looking at the solution, my new solution makes no call to .resample() nor .asfreq(). It just uses pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month based on 'YYYY-MM' values. This is what I could not imagine of when I worked on the old solution making use of .resample()

这篇关于Pandas 将每月数据重新采样为组内每周数据并拆分值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆