对多个列进行分组并应用移动功能 [英] GroupBy on multiple columns and apply moving function

查看:52
本文介绍了对多个列进行分组并应用移动功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们假设我有这个数据集:

Let's suppose that I have this dataset:

Country_id  Company_id  Date    Company_value
1   1   01/01/2018  1
1   1   02/01/2018  0
1   1   03/01/2018  2
1   1   04/01/2018  NA
1   2   01/01/2018  1
1   2   02/01/2018  2
1   2   03/01/2018  NA
1   2   04/01/2018  NA
2   1   01/01/2018  3
2   1   02/01/2018  0
2   1   03/01/2018  2
2   1   04/01/2018  NA
2   2   01/01/2018  1
2   2   02/01/2018  2
2   2   03/01/2018  NA
2   2   04/01/2018  NA

并且我想应用移动函数(例如移动平均值)来检索每个日期和国家/地区的汇总值.

and I want to apply a moving function (e.g. moving average) to retrieve an aggregated value for each date and country.

例如,在移动平均值的情况下(窗口= 2& min_periods = 1,不包括NA),我希望具有以下条件:

So for example in the case of a moving average (with window = 2 & min_periods=1, NAs not counted in) I would like to have the following:

Country_id  Date    Companies_value
1   01/01/2018  1
1   02/01/2018  1
1   03/01/2018  1.33
1   04/01/2018  2
2   01/01/2018  2
2   02/01/2018  1.5
2   03/01/2018  1.33
2   04/01/2018  2

为方便起见,它是通过以下方式计算的:

To make it easier for you this is calculated in the following way:

Country_id  Date    Companies_value
1   01/01/2018  (1+1)/2
1   02/01/2018  (0+1+2+1)/4
1   03/01/2018  (2+0+2)/3
1   04/01/2018  (2)/1
2   01/01/2018  (3+1)/2
2   02/01/2018  (0+3+2+1)/4
2   03/01/2018  (2+0+2)/3
2   04/01/2018  (2)/1

如何使用pandas执行此操作?

举一个简短的例子,例如我想要在日期03/01/2018的国家1中取该国家在02/01/2018和03/01/2018(对于2号窗口).

To give a brief example in words, for instance what I want for country 1 at the date 03/01/2018 is to take an average of all the companies' values for this country for the dates 02/01/2018 and 03/01/2018 (in the case of window size 2).

因此,这就是我想在2018年3月1日为1国所做的事情:

Hence this is what I want to be done for the country 1 at the date 03/01/2018:

( Company_value(Company_1, 03/01/2018) + Company_value(Company_1, 02/01/2018) 
+ Company_value(Company_2, 03/01/2018) + Company_value(Company_2, 02/01/2018) ) / 4 =

= ( 2 + 0 + NA + 2) / 4 

= ( 2 + 0 + 2) / 3 # NAs not counted in

= 1.33

类似地,我想对每个国家/地区的所有日期执行相同的操作.

Analogously, the same I want to be done for the all the dates of each country.

正如我所说,我想对移动平均值pandas以外的移动函数执行相同的操作,因此最好提供一种对任何自定义函数均有效的解决方案.

As I said I would like to do the same with my own moving functions beyond the moving average of pandas so it would be good to provide a solution which is valid for any custom function.

推荐答案

已更新了其他信息

数据:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date':['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01']*4,
              'country_id':[1]*8+[2]*8,
              'company_id':[1]*4+[2]*4+[1]*4+[2]*4,
              'value':[1, 0, 2, np.nan, 1, 2, np.nan, np.nan, 3, 0, 2, np.nan, 1, 2, np.nan, np.nan]})

country_id

df['rolling_sum'] = df.groupby('country_id').apply(lambda x: x.value.rolling(window=2, min_periods=1).sum()).reset_index(drop=True)

country_id

df['sum_records'] = df.groupby('country_id').apply(lambda x: x.value.rolling(window=2, min_periods=1).count()).reset_index(drop=True)

现在在country_iddate内进行分组,以求和,然后除以计数之和

Now groupby within country_id and date, to sum the sums, and divide by sum of counts

summarized_df = df.groupby(['country_id', 'date']).apply(lambda x: x.rolling_sum.sum()/x.sum_records.sum()).reset_index()

country_id  date      
1           2018-01-01    1.000000
            2018-02-01    1.000000
            2018-03-01    1.333333
            2018-04-01    2.000000
2           2018-01-01    2.000000
            2018-02-01    1.500000
            2018-03-01    1.333333
            2018-04-01    2.000000

让我们更详细地了解这一点.由于我们是按country_id进行分组的,因此我们将子集化为一个国家/地区ID,以在以下方面实施此方法:

Lets look at this in further detail. Since we are grouping by country_id, we'll subset out a single country id to practice this methodology on:

如果我们只取其中的一部分,请说country_id == 1:

if we take just one piece of this, say country_id == 1:

df2 = df[df['country_id'] == 1]

         date  country_id  company_id  value
0  2018-01-01           1           1    1.0
1  2018-02-01           1           1    0.0
2  2018-03-01           1           1    2.0
3  2018-04-01           1           1    NaN
4  2018-01-01           1           2    1.0
5  2018-02-01           1           2    2.0
6  2018-03-01           1           2    NaN
7  2018-04-01           1           2    NaN

如果我们想要这个的滚动平均值,我们可以这样做:

If we want the rolling averages for this one, we can just do:

df2.value.rolling(window=2, min_periods=1).mean()
0    1.0
1    0.5
2    1.0
3    2.0
4    1.0
5    1.5
6    2.0
7    NaN

在这里我们可以看到子集country_id == 1数据帧中的值以及它们与滚动平均值的关系:

We can see here that the values from our subset country_id == 1 dataframe and how they relate to the rolling averages:

0    1.0  = (1)/1 = 1
1    0.0  = (0 + 1)/2 = 0.5
2    2.0  = (2 + 0)/2 = 1
3    NaN  = (Nan + 2)/1 = 2
4    1.0  = (1 + Nan)/1 = 1
5    2.0  = (2 + 1)/2 = 1.5
6    NaN  = (Nan + 2)/1 = 2
7    NaN  = (Nan + Nan)/0 = Nan

这是我们为country_id

如果我们想按日期进行分组,然后我们先按country_id分组,然后按日期分组,那么单个分组将如下所示:

If we wanted to get groupings by date, and we went the route of grouping it first by country_id, then date, a single group would look like:

df3 = df[(df['country_id'] == 1) & (df['date'] == '2018-03-01')]

df3.value
2    2.0
6    NaN

df3.value.rolling(window=2, min_periods=1).mean()
2    2.0
6    2.0

df3.value
2    2.0 = (2)/1 = 2
6    NaN = (Nan + 2)/1 = 2

这里的问题是,您希望滚动平均值,按country_id,而不是与date分组. 然后找到按国家/地区划分的滚动平均值后,您想获取那些值并将其取平均值.如果我们要对平均值进行滚动,然后对平均值进行滚动,那么结果将是不正确的.

The problem here, is that you want the rolling averages first by country_id, not grouping with date. Then after you find the rolling averages by country, you want to take those values and average them. If we were to take the rolling averages, and then average those, it would come out incorrect.

因此,让我们回到为country_id == 1创建的原始滚动平均值,然后查看日期:

So lets go back to the original rolling averages we created for country_id == 1, and look at the dates:

2018-01-01    1.0  = (1)/1 =         1
2018-02-01    0.0  = (0 + 1)/2 =     0.5
2018-03-01    2.0  = (2 + 0)/2 =     1
2018-04-01    NaN  = (Nan + 2)/1 =   2
2018-01-01    1.0  = (1 + Nan)/1 =   1
2018-02-01    2.0  = (2 + 1)/2 =     1.5
2018-03-01    NaN  = (Nan + 2)/1 =   2
2018-04-01    NaN  = (Nan + Nan)/0 = Nan

现在最棘手的部分是,此时我们不能仅对它们进行平均,因为例如,如果您查看2018年3月1日的滚动平均值,我们有1和2,即3. 2会给我们1.5.

Now the tricky part here is that at this point we can't just average them together because for example, if you look at 2018-03-01 rolling average values, we have 1 and 2 which is 3. dividing that by 2 would give us 1.5.

我们必须首先对滚动值进行求和,然后除以记录数.

We have to first sum the rolling values, and then divide by the count of records.

这篇关于对多个列进行分组并应用移动功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆