分组总和,索引与列结果 [英] Groupby sum, index vs. column results

查看:84
本文介绍了分组总和,索引与列结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于以下数据框:

df = pd.DataFrame({'group':['a','a','b','b'], 'data':[5,10,100,30]},columns=['group', 'data']) 
print(df)

  group  data
0     a     5
1     a    10
2     b   100
3     b    30

按列分组,添加并创建新列时,结果为:

When grouping by column, adding and creating a new column, the result is:

df['new'] = df.groupby('group')['data'].sum() 
print(df)

  group  data  new
0     a     5  NaN
1     a    10  NaN
2     b   100  NaN
3     b    30  NaN

但是,如果我们将df重置为原始数据并将group列移至索引,

However if we reset the df to the original data and move the group column to the index,

df.set_index('group', inplace=True)
print(df)

       data
group      
a         5
a        10
b       100
b        30

然后进行分组和求和,然后得到:

And then group and sum, then we get:

df['new'] = df.groupby('group')['data'].sum() 
print(df)

       data  new
group           
a         5   15
a        10   15
b       100  130
b        30  130

为什么列组没有设置新列中的值,但是索引分组却设置了新列中的值?

Why does the column group not set the values in the new column but the index grouping does set the values in the new column?

推荐答案

更好的方法是使用

Better here is use GroupBy.transform for return Series with same size like original DataFrame, so after assign all working correctly:

df['new'] = df.groupby('group')['data'].transform('sum')


因为分配新的Series值时要按索引值对齐.如果索引不同,则获取NaN s:


Because if assign new Series values are align by index values. If index is different, get NaNs:

print (df.groupby('group')['data'].sum())
group
a     15
b    130
Name: data, dtype: int64

不同的索引值-获得NaN:

Different index values - get NaNs:

print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')

print (df.index)
RangeIndex(start=0, stop=4, step=1)


df.set_index('group', inplace=True)

print (df.groupby('group')['data'].sum())
group
a     15
b    130
Name: data, dtype: int64

索引可以对齐,因为值匹配:

Index can align, because values matched:

print (df.groupby('group')['data'].sum().index)
Index(['a', 'b'], dtype='object', name='group')

print (df.index)
Index(['a', 'a', 'b', 'b'], dtype='object', name='group')

这篇关于分组总和,索引与列结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆