pandas groupby应用于多个列以生成新列 [英] pandas groupby apply on multiple columns to generate a new column
问题描述
我喜欢使用groupby-apply在pandas数据框中生成一个新列.
I like to generate a new column in pandas dataframe using groupby-apply.
例如,我有一个数据框:
For example, I have a dataframe:
df = pd.DataFrame({'A':[1,2,3,4],'B':['A','B','A','B'],'C':[0,0,1,1]})
并尝试通过groupby-apply生成新列"D".
and try to generate a new column 'D' by groupby-apply.
这有效:
df = df.assign(D=df.groupby('B').C.apply(lambda x: x - x.mean()))
因为(我认为)它返回与数据框具有相同索引的序列:
as (I think) it returns a series with the same index with the dataframe:
In [4]: df.groupby('B').C.apply(lambda x: x - x.mean())
Out[4]:
0 -0.5
1 -0.5
2 0.5
3 0.5
Name: C, dtype: float64
但是,如果我尝试使用多个列来生成新列,则无法将其直接分配给新列.所以这行不通:
But if I try to generate a new column using multiple columns, I cannot assign it directly to a new column. So this doesn't work:
df.assign(D=df.groupby('B').apply(lambda x: x.A - x.C.mean()))
返回
TypeError: incompatible index of inserted column with frame index
实际上,groupby-apply返回:
and in fact, the groupby-apply returns:
In [8]: df.groupby('B').apply(lambda x: x.A - x.C.mean())
Out[8]:
B
A 0 0.5
2 2.5
B 1 1.5
3 3.5
Name: A, dtype: float64
我能做
df.groupby('B').apply(lambda x: x.A - x.C.mean()).reset_index(level=0,drop=True))
但是它看起来很冗长,我不确定这是否总是可以正常工作.
but it seems verbose and I am not sure if this will work as expected always.
所以我的问题是:(i)pandas groupby-apply什么时候返回相似索引的系列与多元索引的系列? (ii)是否有更好的方法通过groupby-apply将新列分配给多个列?
So my question is: (i) when does pandas groupby-apply return a like-indexed series vs a multi-index series? (ii) is there a better way to assign a new column by groupby-apply to multiple columns?
推荐答案
让我们在分组依据中使用group_keys=False
Let's use group_keys=False
in the groupby
df.assign(D=df.groupby('B', group_keys=False).apply(lambda x: x.A - x.C.mean()))
输出:
A B C D
0 1 A 0 0.5
1 2 B 0 1.5
2 3 A 1 2.5
3 4 B 1 3.5
这篇关于pandas groupby应用于多个列以生成新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!