在 pandas 中按组回填列 [英] Backfilling columns by groups in Pandas
问题描述
我有一个csv之类的
A,B,C,D
1,2,,
1,2,30,100
1,2,40,100
4,5,,
4,5,60,200
4,5,70,200
8,9,,
在第1行和第4行中,缺少C值( NaN
).我想分别从第2行和第5行获取它们的值.(第一次出现相同的A,B值).
In row 1 and row 4 C value is missing (NaN
). I want to take their value from row 2 and 5 respectively. (First occurrence of same A,B value).
如果未找到匹配的行,则仅输入0(如最后一行所示)预期的操作:
If no matching row is found, just put 0 (like in last line) Expected op:
A,B,C,D
1,2,30,
1,2,30,100
1,2,40,100
4,5,60,
4,5,60,200
4,5,70,200
8,9,0,
使用 fillna
我发现 bfill:使用NEXT有效观察来填补空白
,但是 NEXT
观察必须逻辑上地进行(看colA,B值),而不仅仅是即将到来的C列值
using fillna
I found bfill: use NEXT valid observation to fill gap
but the NEXT
observation has to be taken logically (looking at col A,B values) and not just the upcoming C column value
推荐答案
您必须在 A
和 B
df.groupby >首先,然后应用 bfill
函数:
You'll have to call df.groupby
on A
and B
first and then apply the bfill
function:
In [501]: df.C = df.groupby(['A', 'B']).apply(lambda x: x.C.bfill()).reset_index(drop=True)
In [502]: df
Out[502]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
您还可以将其分组,然后直接调用 dfGroupBy.bfill
(我认为这样会更快):
You can also group and then call dfGroupBy.bfill
directly (I think this would be faster):
In [508]: df.C = df.groupby(['A', 'B']).C.bfill().fillna(0).astype(int); df
Out[508]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
如果您希望摆脱 D
中的 NaN
,可以执行以下操作:
If you wish to get rid of NaN
s in D
, you could do:
df.D.fillna('', inplace=True)
这篇关于在 pandas 中按组回填列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!