如何在不丢失其他列的情况下对 pandas 数据框执行groupby? [英] How to perform groupby on pandas dataframe without losing other columns?
问题描述
我有一个如下所示的数据框:
I have one Dataframe like below:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
'person_count': ['10','14','25','20','34','23','43','34','10','20'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]
print df
sport_name person_name city person_symbol person_count month sir
0 football ramesh mumbai ram 10 2017-01-23 a
1 football ramesh mumbai mum 14 2017-01-23 a
2 football ramesh delhi mum 25 2017-01-23 a
3 football ramesh delhi ram 20 2017-01-23 a
4 football ramesh mumbai ram 34 2017-02-26 b
5 football ramesh mumbai mum 23 2017-02-26 b
6 football ramesh delhi mum 43 2017-02-26 b
7 football ramesh delhi ram 34 2017-02-26 b
8 basketball mahesh pune mah 10 2017-03-03 c
9 basketball mahesh nagpur mah 20 2017-03-03 c
在此数据框中,我要创建两列数据框,分别命名为 derived_symbol和 perso n_count。为了创建它,我需要关注以下一些条件:
From this Dataframe, I want to create two column dataframe named as "derived_symbol" and "person_count". For creating it I need to focus on some condition like below:
- 派生符号需要为每个唯一的城市和person_symbol形成。
- person_count是基于derived_symbol是什么计算的。
对于上述事情,我做了一些事情,它是工作正常:
For this above thing I did something and it is working fine:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
'person_count': ['10','14','25','20','34','23','43','34','10','20'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]
df['person_count'] = df['person_count'].astype(int)
df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')
df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'
df2 = df1.groupby(['derived_symbol','month','sir','person_name'])['person_count'].sum().reset_index(name='person_count')
print (df2)
以上代码的输出:
derived_symbol month sir sport_name person_name person_count
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c basketball mahesh 30
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c basketball mahesh 20
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c basketball mahesh 10
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a football ramesh 45
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b football ramesh 77
5 football.ramesh.TOTAL.mum_count 2017-01-23 a football ramesh 39
6 football.ramesh.TOTAL.mum_count 2017-02-26 b football ramesh 66
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a football ramesh 24
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b football ramesh 57
9 football.ramesh.TOTAL.ram_count 2017-01-23 a football ramesh 30
10 football.ramesh.TOTAL.ram_count 2017-02-26 b football ramesh 68
但是我想要Dataf加上两个额外的列,分别是 城市和 person_symbol ,如下所示:
But I want Dataframe with two additional columns which are "city" and "person_symbol" like below:
derived_symbol month sir person_name sport_name person_count city person_symbol
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c mahesh basketball 30 NO_ENTRY mah
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c mahesh basketball 20 nagpur NO_ENTRY
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c mahesh football 10 pune NO_ENTRY
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a ramesh football 45 delhi NO_ENTRY
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b ramesh football 77 delhi NO_ENTRY
5 football.ramesh.TOTAL.mum_count 2017-01-23 a ramesh football 39 NO_ENTRY mum
6 football.ramesh.TOTAL.mum_count 2017-02-26 b ramesh football 66 NO_ENTRY mum
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a ramesh football 24 mumbai NO_ENTRY
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b ramesh football 57 mumbai NO_ENTRY
9 football.ramesh.TOTAL.ram_count 2017-01-23 a ramesh football 30 NO_ENTRY ram
10 football.ramesh.TOTAL.ram_count 2017-02-26 b ramesh football 68 NO_ENTRY ram
实际上,创建这两个符号的逻辑是:
Actually the logic behind creation of these two symbols is:
- 如果为特定城市创建了当前行,则城市列将包含城市值, person_symbol将包含 NO_ENTRY。
- 如果为特定符号创建了当前行,则person_symbol列将包含person_symbol值,而city则将包含NO_ENTRY。
如何在不丢失以前行为的情况下进行数据操作?
How I can do such manipulation of the data without losing my previous behavior?
推荐答案
您可以先将列 level_5
和 val
添加到 groupby
:
You can first add columns level_5
and val
to groupby
:
df2 = df1.groupby(['derived_symbol',
'month','sir',
'person_name',
'level_5',
'val'])['person_count'].sum().reset_index(name='person_count')
print (df2)
derived_symbol month sir person_name \
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c mahesh
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c mahesh
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c mahesh
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a ramesh
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b ramesh
5 football.ramesh.TOTAL.mum_count 2017-01-23 a ramesh
6 football.ramesh.TOTAL.mum_count 2017-02-26 b ramesh
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a ramesh
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b ramesh
9 football.ramesh.TOTAL.ram_count 2017-01-23 a ramesh
10 football.ramesh.TOTAL.ram_count 2017-02-26 b ramesh
level_5 val person_count
0 person_symbol mah 30
1 city nagpur 20
2 city pune 10
3 city delhi 45
4 city delhi 77
5 person_symbol mum 39
6 person_symbol mum 66
7 city mumbai 24
8 city mumbai 57
9 person_symbol ram 30
10 person_symbol ram 68
然后通过 unstack
重塑形状, 无
由 fillna
转换为 NO_ENTRY
。
And then reshape back by unstack
, None
convert to NO_ENTRY
by fillna
.
df3=df2.set_index(['derived_symbol',
'month',
'sir',
'person_name',
'person_count',
'level_5'])['val'] \
.unstack() \
.fillna('NO_ENTRY') \
.rename_axis(None, 1) \
.reset_index()
< hr>
print (df3)
derived_symbol month sir person_name \
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c mahesh
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c mahesh
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c mahesh
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a ramesh
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b ramesh
5 football.ramesh.TOTAL.mum_count 2017-01-23 a ramesh
6 football.ramesh.TOTAL.mum_count 2017-02-26 b ramesh
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a ramesh
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b ramesh
9 football.ramesh.TOTAL.ram_count 2017-01-23 a ramesh
10 football.ramesh.TOTAL.ram_count 2017-02-26 b ramesh
person_count city person_symbol
0 30 NO_ENTRY mah
1 20 nagpur NO_ENTRY
2 10 pune NO_ENTRY
3 45 delhi NO_ENTRY
4 77 delhi NO_ENTRY
5 39 NO_ENTRY mum
6 66 NO_ENTRY mum
7 24 mumbai NO_ENTRY
8 57 mumbai NO_ENTRY
9 30 NO_ENTRY ram
10 68 NO_ENTRY ram
这篇关于如何在不丢失其他列的情况下对 pandas 数据框执行groupby?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!