使用大 pandas 按列分组,然后根据条件创建新列 [英] Use pandas to group by column and then create a new column based on a condition

查看:103
本文介绍了使用大 pandas 按列分组,然后根据条件创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要用熊猫重现SQL如此轻松地完成的事情:

I need to reproduce with pandas what SQL does so easily:

select
    del_month
    , sum(case when off0_on1 = 1 then 1 else 0 end) as on1
    , sum(case when off0_on1 = 0 then 1 else 0 end) as off0
from a1
group by del_month
order by del_month

这是一个示例性的说明性熊猫数据框,可用于:

Here is a sample, illustrative pandas dataframe to work on:

a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 'off0_on1':[0,0,1,1,0,1,1,1]})

这是我尝试用熊猫重现上述SQL的尝试.第一行有效.第二行显示错误:

Here are my attempts to reproduce the above SQL with pandas. The first line works. The second line gives an error:

a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(sum)
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(sum(lambda x: 1 if x == 0 else 0))

这是第二行的错误:

TypeError: 'function' object is not iterable

我以前的问题 lambda函数出现问题,此问题已解决.更大的问题是如何在分组数据上重现SQL的"sum(case when)"逻辑.我正在寻找一个通用的解决方案,因为我需要经常做这种事情.我上一个问题的答案建议在lambda函数中使用map(),但是"off0"列的以下结果不是我所需要的.我想要的是"on1"列.整个组的答案应该相同(即"del_month").

This previous question of mine had a problem with the lambda function, which was solved. The bigger problem is how to reproduce SQL's "sum(case when)" logic on grouped data. I'm looking for a general solution, since I need to do this sort of thing often. The answers in my previous question suggested using map() inside the lambda function, but the following results for the "off0" column are not what I need. The "on1" column is what I want. The answer should be the same for the whole group (i.e. "del_month").

推荐答案

简单地将条件逻辑表达式中的真值相加:

Simply sum the Trues in your conditional logic expressions:

import pandas as pd

a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 
                   'off0_on1':[0,0,1,1,0,1,1,1]})

a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==1))    
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==0))

print(a1)    
#    del_month  off0_on1  on1  off0
# 0          1         0    2     2
# 1          1         0    2     2
# 2          1         1    2     2
# 3          1         1    2     2
# 4          2         0    3     1
# 5          2         1    3     1
# 6          2         1    3     1
# 7          2         1    3     1

类似地,如果方言支持的话,您可以在SQL中执行相同的操作,

Similarly, you can do the same in SQL if dialect supports it which most should:

select
    del_month
    , sum(off0_on1 = 1) as on1
    , sum(off0_on1 = 0) as off0
from a1
group by del_month
order by del_month

要在熊猫中复制上面的SQL,请不要使用transform,而应在groupby().apply()调用中发送多个聚合:

And to replicate above SQL in pandas, don't use transform but send multiple aggregates in a groupby().apply() call:

def aggfunc(x):
    data = {'on1': sum(x['off0_on1'] == 1),
            'off0': sum(x['off0_on1'] == 0)}

    return pd.Series(data)

g = a1.groupby('del_month').apply(aggfunc)

print(g)    
#            on1  off0
# del_month           
# 1            2     2
# 2            3     1

这篇关于使用大 pandas 按列分组,然后根据条件创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆