在 pandas 数据框列中正向填充缺失值的有效解决方案? [英] Efficient solution for forward filling missing values in a pandas dataframe column?
问题描述
我需要在组内数据框的列中转发填充值.我应该指出,构造中永远不会丢失组中的第一个值.目前,我有以下解决方案.
I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment.
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})
# desired output
a b
1 1
1 1
2 2
2 2
2 2
这是到目前为止我尝试过的三种解决方案.
Here are the three solutions that I've tried so far.
# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')
# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')
所有这三个都产生我想要的输出,但是前两个在我的数据集上花费了很长时间,而第三个解决方案则需要更多的内存,并且感觉很笨拙.还有其他方法可以向前填充列吗?
All three of these produce my desired output, but the first two take a really long time on my data set, and the third solution is more memory intensive and feels rather clunky. Are there any other ways to forward fill a column?
推荐答案
您需要按两个列df.sort_values(['a', 'b']).ffill()
进行排序,以确保健壮性.如果np.nan
留在组中的第一个位置,则ffill
将用上一个组中的值填充该值.因为np.nan
将放在任何排序的末尾,所以同时按a
和b
进行排序可确保您在任何组的开头都不会有np.nan
.然后,您可以使用.loc
或.reindex
的初始索引来取回原始订单.
You need to sort by both columns df.sort_values(['a', 'b']).ffill()
to ensure robustness. If an np.nan
is left in the first position within a group, ffill
will fill that with a value from the prior group. Because np.nan
will be placed at the end of any sort, sorting by both a
and b
ensures that you will not have np.nan
at the front of any group. You can then .loc
or .reindex
with the initial index to get back your original order.
这显然比其他建议要慢一点……但是,我认为这是正确,而其他建议则不是.
This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.
演示
demo
考虑数据框df
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})
print(df)
a b
0 1 1.0
1 1 NaN
2 2 NaN
3 2 2.0
4 2 NaN
尝试
df.sort_values('a').ffill()
a b
0 1 1.0
1 1 1.0
2 2 1.0 # <--- this is incorrect
3 2 2.0
4 2 2.0
相反
df.sort_values(['a', 'b']).ffill().loc[df.index]
a b
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 2 2.0
特别说明
如果整个组的值都缺失,这仍然是不正确的
special note
This is still incorrect if an entire group has missing values
这篇关于在 pandas 数据框列中正向填充缺失值的有效解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!