为每行添加唯一组到DF,包括其他列的总和 [英] Add unique groups to DF for each row including sum from other columns
问题描述
我有一个看起来像这样的DatFrame:
I got a DatFrame looking like this:
ID field_1 area_1 field_2 area_2 field_3 area_3 field_4 area_4
1 scoccer 500 basketball 200 swimming 100 basketball 50
2 volleyball 100 np.nan np.nan np.nan np.nan np.nan np.nan
3 basketball 1000 football 10 np.nan np.nan np.nan np.nan
4 swimming 280 swimming 200 basketball 320 np.nan np.nan
5 volleyball 110 football 160 volleyball 30 np.nan np.nan
原始DataFrame具有相同的结构,但包含的字段field_1到field_30以及area_1到area_30.
The original DataFrame has the same structure but containing columns field_1 up to field_30 as well as area_1 to area_30.
我想根据"field_x"中的不同表达式将具有水平组的列添加到DF中,并汇总相应的区域...添加的列应如下所示:
I'd like to add columns to the DF with horiztonal groups depending on the distinct expressions in 'field_x' with the correspondig area summed up... the added columns should look like this:
ID group_1 area_1 group_2 area_2 group_3 area_3
1 scoccer 500 basketball 250 swimming 100
2 volleyball 100
3 basketball 1000 football 10
4 swimming 480 basketball 320
5 volleyball 140 football 160
有一种简单的方法可以实现吗?
Is there an easy way to realize this?
推荐答案
使用 pd.wide_to_long
调整DataFrame的形状,从而使您可以按字段和ID分组并汇总面积.在使用 cumcount
创建列标签之后,然后 pivot_table
返回宽格式.
Use pd.wide_to_long
to reshape the DataFrame, which allows you to group by field and ID and sum the areas. Then pivot_table
back to the wide format, after creating the column label with cumcount
.
df = (pd.wide_to_long(df, i='ID', j='num', stubnames=['field', 'area'], sep='_')
.groupby(['ID', 'field'])['area'].sum()
.reset_index())
# ID field area
#0 1 basketball 250.0
#1 1 scoccer 500.0
#2 1 swimming 100.0
#3 2 volleyball 100.0
#4 3 basketball 1000.0
#5 3 football 10.0
#6 4 basketball 320.0
#7 4 swimming 480.0
#8 5 football 160.0
#9 5 volleyball 140.0
df['idx'] = df.groupby('ID').cumcount()+1
df = (pd.pivot_table(df, index='ID', columns='idx', values=['field', 'area'],
aggfunc='first')
.sort_index(axis=1, level=1))
df.columns = ['_'.join(map(str, tup)) for tup in df.columns]
area_1 field_1 area_2 field_2 area_3 field_3
ID
1 250.0 basketball 500.0 scoccer 100.0 swimming
2 100.0 volleyball NaN NaN NaN NaN
3 1000.0 basketball 10.0 football NaN NaN
4 320.0 basketball 480.0 swimming NaN NaN
5 160.0 football 140.0 volleyball NaN NaN
只是为了好玩,您可以使用未记录的 pd.lreshape
代替 wide_to_long
.
# Change range to (1,31) for your real data.
pd.lreshape(df, {'area': [f'area_{i}' for i in range(1,5)],
'field': [f'field_{i}' for i in range(1,5)]}
# ID area field
#0 1 500.0 scoccer
#1 2 100.0 volleyball
#2 3 1000.0 basketball
#3 4 280.0 swimming
#4 5 110.0 volleyball
#5 1 200.0 basketball
#....
#10 4 320.0 basketball
#11 5 30.0 volleyball
#12 1 50.0 basketball
这篇关于为每行添加唯一组到DF,包括其他列的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!