Pandas groupby之后如何获得多个条件操作? [英] how to get multiple conditional operations after a Pandas groupby?
问题描述
考虑以下示例:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : [12,10,-2,-4,-2,5,8,7],
'C' : [-5,5,-20,0,1,5,4,-4]})
df
Out[12]:
A B C
0 foo 12 -5
1 bar 10 5
2 foo -2 -20
3 bar -4 0
4 foo -2 1
5 bar 5 5
6 foo 8 4
7 foo 7 -4
在这里,我需要为A中的每个组 计算B中的元素 ,前提是C为非负数(即> = 0,这是基于另一列的条件).反之亦然.
但是,我下面的代码失败.
df.groupby('A').agg({'B': lambda x: x[x.C>0].sum(),
'C': lambda x: x[x.B>0].sum()})
AttributeError: 'Series' object has no attribute 'B'
因此似乎首选apply
(因为应用请参见我认为的所有数据框),但是不幸的是,我无法将字典与apply
一起使用.所以我被困住了.有什么想法吗?
一个不太那么有效的解决方案是在运行groupby
之前创建这些条件变量,但是我敢肯定,该解决方案不会利用 Pandas
的潜力>.
因此,例如,组bar
和column B
的预期输出为
+10 (indeed C equals 5 and is >=0)
-4 (indeed C equals 0 and is >=0)
+5 = 11
另一个例子:
组foo
和column B
NaN (indeed C equals -5 so I dont want to consider the 12 value in B)
+ NaN (indeed C= -20)
-2 (indeed C=1 so its positive)
+ 8
+NaN = 6
请注意,我使用NaNs
而不是零,因为如果我们要放置零,那么除求和函数外的另一个函数将给出错误的结果(中位数).
换句话说,这是一个简单的条件总和,其中条件基于另一列. 谢谢!
我认为您可以使用:
print df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(),
'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()})
C B
A
bar 11 10
foo 6 -5
更好的理解是自定义功能,与上面的功能相同
def f(x):
s = df.loc[x.index, 'C']
return s[x>=0].sum()
def f1(x):
s = df.loc[x.index, 'B']
return s[x>=0].sum()
print df.groupby('A').agg({'B': f, 'C': f1})
C B
A
bar 11 10
foo 6 -5
root的解决方案很好,但是可以更好:
def my_func(row):
b = row[row.C >= 0].B.sum()
c = row[row.B >= 0].C.sum()
return pd.Series({'C':b, 'B':c})
result = df.groupby('A').apply(my_func)
C B
A
bar 11 10
foo 6 -5
consider the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : [12,10,-2,-4,-2,5,8,7],
'C' : [-5,5,-20,0,1,5,4,-4]})
df
Out[12]:
A B C
0 foo 12 -5
1 bar 10 5
2 foo -2 -20
3 bar -4 0
4 foo -2 1
5 bar 5 5
6 foo 8 4
7 foo 7 -4
Here I need to compute, for each group in A, the sum of elements in B conditional on C being non-negative (i.e. being >=0, a condition based on another column). And vice-versa for C.
However, my code below fails.
df.groupby('A').agg({'B': lambda x: x[x.C>0].sum(),
'C': lambda x: x[x.B>0].sum()})
AttributeError: 'Series' object has no attribute 'B'
So it seems apply
would be preferred (because apply sees all the dataframe I think), but unfortunately I cannot use a dictionary with apply
. So I am stuck. Any ideas?
One not-so-pretty not-so-efficient solution would be to create these conditional variables before running the groupby
, but I am sure this solution does not use the potential of Pandas
.
So, for instance, the expected output for the group bar
and column B
would be
+10 (indeed C equals 5 and is >=0)
-4 (indeed C equals 0 and is >=0)
+5 = 11
Another example:
group foo
and column B
NaN (indeed C equals -5 so I dont want to consider the 12 value in B)
+ NaN (indeed C= -20)
-2 (indeed C=1 so its positive)
+ 8
+NaN = 6
Remark that I use NaNs
instead of zero because another function than a sum would give wrong results (median) if we were to put zeros.
In other words, this is a simple conditional sum where the condition is based on another column. Thanks!
I think you can use:
print df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(),
'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()})
C B
A
bar 11 10
foo 6 -5
Better for understanding are custom function what is same as above:
def f(x):
s = df.loc[x.index, 'C']
return s[x>=0].sum()
def f1(x):
s = df.loc[x.index, 'B']
return s[x>=0].sum()
print df.groupby('A').agg({'B': f, 'C': f1})
C B
A
bar 11 10
foo 6 -5
EDIT:
root's solution is very nice, but it can be better:
def my_func(row):
b = row[row.C >= 0].B.sum()
c = row[row.B >= 0].C.sum()
return pd.Series({'C':b, 'B':c})
result = df.groupby('A').apply(my_func)
C B
A
bar 11 10
foo 6 -5
这篇关于Pandas groupby之后如何获得多个条件操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!