如何在不丢失其他列的情况下对 pandas 数据框执行groupby? [英] How to perform groupby on pandas dataframe without losing other columns?

查看:64
本文介绍了如何在不丢失其他列的情况下对 pandas 数据框执行groupby?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框:

I have one Dataframe like below:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
           'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
               'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
        'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
        'person_count': ['10','14','25','20','34','23','43','34','10','20'],
        'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
        'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]

print df

   sport_name person_name    city person_symbol person_count       month sir
0    football      ramesh  mumbai           ram           10  2017-01-23   a
1    football      ramesh  mumbai           mum           14  2017-01-23   a
2    football      ramesh   delhi           mum           25  2017-01-23   a
3    football      ramesh   delhi           ram           20  2017-01-23   a
4    football      ramesh  mumbai           ram           34  2017-02-26   b
5    football      ramesh  mumbai           mum           23  2017-02-26   b
6    football      ramesh   delhi           mum           43  2017-02-26   b
7    football      ramesh   delhi           ram           34  2017-02-26   b
8  basketball      mahesh    pune           mah           10  2017-03-03   c
9  basketball      mahesh  nagpur           mah           20  2017-03-03   c

在此数据框中,我要创建两列数据框,分别命名为 derived_symbol和 perso n_count。为了创建它,我需要关注以下一些条件:

From this Dataframe, I want to create two column dataframe named as "derived_symbol" and "person_count". For creating it I need to focus on some condition like below:


  • 派生符号需要为每个唯一的城市和person_symbol形成。

  • person_count是基于derived_symbol是什么计算的。

对于上述事情,我做了一些事情,它是工作正常:

For this above thing I did something and it is working fine:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
           'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
               'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
        'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
        'person_count': ['10','14','25','20','34','23','43','34','10','20'],
        'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
        'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]

df['person_count'] = df['person_count'].astype(int)

df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

df2 = df1.groupby(['derived_symbol','month','sir','person_name'])['person_count'].sum().reset_index(name='person_count')
print (df2)

以上代码的输出:

         derived_symbol                   month        sir sport_name  person_name  person_count
0      basketball.mahesh.TOTAL.mah_count  2017-03-03   c  basketball   mahesh            30 
1   basketball.mahesh.TOTAL.nagpur_count  2017-03-03   c  basketball   mahesh            20
2     basketball.mahesh.TOTAL.pune_count  2017-03-03   c  basketball   mahesh            10
3      football.ramesh.TOTAL.delhi_count  2017-01-23   a    football   ramesh            45
4      football.ramesh.TOTAL.delhi_count  2017-02-26   b    football   ramesh            77
5        football.ramesh.TOTAL.mum_count  2017-01-23   a    football   ramesh            39
6        football.ramesh.TOTAL.mum_count  2017-02-26   b    football   ramesh            66
7     football.ramesh.TOTAL.mumbai_count  2017-01-23   a    football   ramesh            24
8     football.ramesh.TOTAL.mumbai_count  2017-02-26   b    football   ramesh            57
9        football.ramesh.TOTAL.ram_count  2017-01-23   a    football   ramesh            30
10       football.ramesh.TOTAL.ram_count  2017-02-26   b    football   ramesh           68

但是我想要Dataf加上两个额外的列,分别是 城市 person_symbol ,如下所示:

But I want Dataframe with two additional columns which are "city" and "person_symbol" like below:

                          derived_symbol       month sir person_name  sport_name    person_count    city        person_symbol
0      basketball.mahesh.TOTAL.mah_count  2017-03-03   c      mahesh   basketball   30          NO_ENTRY        mah
1   basketball.mahesh.TOTAL.nagpur_count  2017-03-03   c      mahesh   basketball   20          nagpur      NO_ENTRY
2     basketball.mahesh.TOTAL.pune_count  2017-03-03   c      mahesh   football     10          pune        NO_ENTRY
3      football.ramesh.TOTAL.delhi_count  2017-01-23   a      ramesh   football     45          delhi       NO_ENTRY
4      football.ramesh.TOTAL.delhi_count  2017-02-26   b      ramesh   football     77          delhi       NO_ENTRY
5        football.ramesh.TOTAL.mum_count  2017-01-23   a      ramesh   football     39          NO_ENTRY    mum
6        football.ramesh.TOTAL.mum_count  2017-02-26   b      ramesh   football     66          NO_ENTRY    mum
7     football.ramesh.TOTAL.mumbai_count  2017-01-23   a      ramesh   football     24          mumbai      NO_ENTRY
8     football.ramesh.TOTAL.mumbai_count  2017-02-26   b      ramesh   football     57          mumbai      NO_ENTRY
9        football.ramesh.TOTAL.ram_count  2017-01-23   a      ramesh   football     30          NO_ENTRY    ram
10       football.ramesh.TOTAL.ram_count  2017-02-26   b      ramesh   football     68          NO_ENTRY    ram

实际上,创建这两个符号的逻辑是:

Actually the logic behind creation of these two symbols is:


  • 如果为特定城市创建了当前行,则城市列将包含城市值, person_symbol将包含 NO_ENTRY。

  • 如果为特定符号创建了当前行,则person_symbol列将包含person_symbol值,而city则将包含NO_ENTRY。

如何在不丢失以前行为的情况下进行数据操作?

How I can do such manipulation of the data without losing my previous behavior?

推荐答案

您可以先将列 level_5 val 添加到 groupby

You can first add columns level_5 and val to groupby:

df2 = df1.groupby(['derived_symbol',
                    'month','sir',
                    'person_name', 
                    'level_5',
                    'val'])['person_count'].sum().reset_index(name='person_count')
print (df2)
                          derived_symbol       month sir person_name  \
0      basketball.mahesh.TOTAL.mah_count  2017-03-03   c      mahesh   
1   basketball.mahesh.TOTAL.nagpur_count  2017-03-03   c      mahesh   
2     basketball.mahesh.TOTAL.pune_count  2017-03-03   c      mahesh   
3      football.ramesh.TOTAL.delhi_count  2017-01-23   a      ramesh   
4      football.ramesh.TOTAL.delhi_count  2017-02-26   b      ramesh   
5        football.ramesh.TOTAL.mum_count  2017-01-23   a      ramesh   
6        football.ramesh.TOTAL.mum_count  2017-02-26   b      ramesh   
7     football.ramesh.TOTAL.mumbai_count  2017-01-23   a      ramesh   
8     football.ramesh.TOTAL.mumbai_count  2017-02-26   b      ramesh   
9        football.ramesh.TOTAL.ram_count  2017-01-23   a      ramesh   
10       football.ramesh.TOTAL.ram_count  2017-02-26   b      ramesh   

          level_5     val  person_count  
0   person_symbol     mah            30  
1            city  nagpur            20  
2            city    pune            10  
3            city   delhi            45  
4            city   delhi            77  
5   person_symbol     mum            39  
6   person_symbol     mum            66  
7            city  mumbai            24  
8            city  mumbai            57  
9   person_symbol     ram            30  
10  person_symbol     ram            68 

然后通过 unstack 重塑形状, fillna 转换为 NO_ENTRY

And then reshape back by unstack, None convert to NO_ENTRY by fillna.

df3=df2.set_index(['derived_symbol',
                   'month',
                   'sir',
                   'person_name',
                   'person_count', 
                   'level_5'])['val'] \
       .unstack() \
       .fillna('NO_ENTRY') \
       .rename_axis(None, 1) \
       .reset_index()



< hr>


print (df3)
                          derived_symbol       month sir person_name  \
0      basketball.mahesh.TOTAL.mah_count  2017-03-03   c      mahesh   
1   basketball.mahesh.TOTAL.nagpur_count  2017-03-03   c      mahesh   
2     basketball.mahesh.TOTAL.pune_count  2017-03-03   c      mahesh   
3      football.ramesh.TOTAL.delhi_count  2017-01-23   a      ramesh   
4      football.ramesh.TOTAL.delhi_count  2017-02-26   b      ramesh   
5        football.ramesh.TOTAL.mum_count  2017-01-23   a      ramesh   
6        football.ramesh.TOTAL.mum_count  2017-02-26   b      ramesh   
7     football.ramesh.TOTAL.mumbai_count  2017-01-23   a      ramesh   
8     football.ramesh.TOTAL.mumbai_count  2017-02-26   b      ramesh   
9        football.ramesh.TOTAL.ram_count  2017-01-23   a      ramesh   
10       football.ramesh.TOTAL.ram_count  2017-02-26   b      ramesh   

    person_count      city person_symbol  
0             30  NO_ENTRY           mah  
1             20    nagpur      NO_ENTRY  
2             10      pune      NO_ENTRY  
3             45     delhi      NO_ENTRY  
4             77     delhi      NO_ENTRY  
5             39  NO_ENTRY           mum  
6             66  NO_ENTRY           mum  
7             24    mumbai      NO_ENTRY  
8             57    mumbai      NO_ENTRY  
9             30  NO_ENTRY           ram  
10            68  NO_ENTRY           ram  

这篇关于如何在不丢失其他列的情况下对 pandas 数据框执行groupby?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆