跨行对 Pandas 数据框进行分组 [英] Grouping Pandas dataframe across rows

查看:49
本文介绍了跨行对 Pandas 数据框进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的 csv:

client1,client2,client3,client4,client5,client6,amount,,,Comp1,,,,4.475000,,,Comp2,,,,16.305584,,,Comp3,,,,4.050000Comp2,Comp1,,Comp4,,,,21.000000,,,Comp4,,,,30.000000,Comp1,,Comp2,,,,5.137500,,,Comp3,,,,52.650000,,,Comp1,,,,2.650000Comp3,,,Comp3,,,,29.000000Comp5,,,Comp2,,,,20.809000Comp5,,,Comp2,,,,15.100000Comp5,,,Comp2,,,,52.404000

在将其读入 Pandas 数据帧 df 后,我想分两步进行聚合:

第一步:

首先,我将金额相加:

client1 client2 client3 client4 client5 client6 金额Comp1 7.125000Comp2 16.305584Comp3 56.700000Comp4 30.000000Comp1 Comp2 5.137500Comp2 Comp1 Comp4 21.000000Comp3 Comp3 29.000000Comp5 Comp2 88.313000

然后,我想按每个客户端名称进行聚合,这样如果像第 5 组一样涉及多个客户端,则 5.1375 必须在 Comp1 和 Comp2 之间平均分配.试过这样:

df.groupby(['client1','client2','client3','client4','client5','client6']).apply(lambda x: x['amount'].sum()/len(x) if x.any().nunique()>=1 else x['amount'].sum())客户端1 客户端2 客户端3 客户端4 客户端5 客户端6 00 Comp1 3.5625001 Comp2 16.3055842 Comp3 28.3500003 Comp4 30.0000004 Comp1 Comp2 5.1375005 Comp2 Comp1 Comp4 21.0000006 Comp3 Comp3 29.0000007 Comp5 Comp2 29.437667

预期输出为:

客户金额比较 1 4.475+21/3+5.1375/2+2.65 = 16.69375Comp2 16.305584+21/3+20.809/2+15.10/2+52.404/2 = 67.462084Comp3 4.05+52.65+29 = 85.7Comp4 21/3+30 = 37比较 5 20.809/2+15.10/2+52.404/2 = 44.1565

我尝试使用 sum(axis=0) 但没有用.

解决方案

我们可以在这里使用一点数学

cols = ['金额']# 将金额除以非空字段df['new'] = df['amount']/df.drop(cols,1).notnull().sum(1)#通过删除amount列,unstack并删除nans将索引设置为新的.x = df.drop(cols,1).set_index('new').unstack().dropna()#仅从数量和客户端创建数据框ndf = pd.DataFrame({'amount':x.index.droplevel(0).values,'clients':x.values})#Groupby 客户端并获取总和ndf.groupby('clients').sum()

输出:

<前>数量客户Comp1 16.360417Comp2 69.697501Comp3 85.700000Comp4 36.666667Comp5 44.156500

I've a csv like this:

client1,client2,client3,client4,client5,client6,amount
,,,Comp1,,,4.475000
,,,Comp2,,,16.305584
,,,Comp3,,,4.050000
Comp2,Comp1,,Comp4,,,21.000000
,,,Comp4,,,30.000000
,Comp1,,Comp2,,,5.137500
,,,Comp3,,,52.650000
,,,Comp1,,,2.650000
Comp3,,,Comp3,,,29.000000
Comp5,,,Comp2,,,20.809000
Comp5,,,Comp2,,,15.100000
Comp5,,,Comp2,,,52.404000

After reading it into a pandas dataframe, df, I wanted to aggregate in two steps:

Step1:

First, I sum the amount:

client1 client2 client3 client4 client5 client6  amount
                        Comp1                    7.125000
                        Comp2                    16.305584
                        Comp3                    56.700000
                        Comp4                    30.000000
         Comp1          Comp2                    5.137500
Comp2    Comp1          Comp4                    21.000000
Comp3                   Comp3                    29.000000
Comp5                   Comp2                    88.313000  

Then, I want to aggregate by each client name such that if multiple clients are involved like in group 5, then 5.1375 must be split equally between Comp1 and Comp2. Tried it this way:

df.groupby(['client1','client2','client3','client4','client5','client6']).apply(lambda x: x['amount'].sum()/len(x) if x.any().nunique()>=1 else x['amount'].sum())



client1 client2 client3 client4 client5 client6 0
0                           Comp1                   3.562500
1                           Comp2                   16.305584
2                           Comp3                   28.350000
3                           Comp4                   30.000000
4           Comp1           Comp2                   5.137500
5   Comp2   Comp1           Comp4                   21.000000
6   Comp3                   Comp3                   29.000000
7   Comp5                   Comp2                   29.437667

Expected Output is:

Client Amount 
Comp1  4.475+21/3+5.1375/2+2.65 = 16.69375
Comp2  16.305584+21/3+20.809/2+15.10/2+52.404/2 = 67.462084
Comp3  4.05+52.65+29 = 85.7
Comp4  21/3+30 = 37
Comp5  20.809/2+15.10/2+52.404/2 = 44.1565

I tried using sum(axis=0) but of no use.

解决方案

We can use a bit a maths here

cols = ['amount'] 
# Divide the amount by non null fields 
df['new'] = df['amount']/df.drop(cols,1).notnull().sum(1)

#Set the index as new by droping amount column, unstack and drop the nans.
x = df.drop(cols,1).set_index('new').unstack().dropna()

#Create dataframe just from amount and the clients
ndf = pd.DataFrame({'amount':x.index.droplevel(0).values,'clients':x.values})

#Groupby client and get the sum 
ndf.groupby('clients').sum()

Output:

          amount
clients           
Comp1    16.360417
Comp2    69.697501
Comp3    85.700000
Comp4    36.666667
Comp5    44.156500

这篇关于跨行对 Pandas 数据框进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆