在 pandas 数据框列中正向填充缺失值的有效解决方案? [英] Efficient solution for forward filling missing values in a pandas dataframe column?

查看:71
本文介绍了在 pandas 数据框列中正向填充缺失值的有效解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在组内数据框的列中转发填充值.我应该指出,构造中永远不会丢失组中的第一个值.目前,我有以下解决方案.

I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment.

df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})

# desired output
a   b
1   1
1   1
2   2
2   2
2   2

这是到目前为止我尝试过的三种解决方案.

Here are the three solutions that I've tried so far.

# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')

# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')

所有这三个都产生我想要的输出,但是前两个在我的数据集上花费了很长时间,而第三个解决方案则需要更多的内存,并且感觉很笨拙.还有其他方法可以向前填充列吗?

All three of these produce my desired output, but the first two take a really long time on my data set, and the third solution is more memory intensive and feels rather clunky. Are there any other ways to forward fill a column?

推荐答案

您需要按两个列df.sort_values(['a', 'b']).ffill()进行排序,以确保健壮性.如果np.nan留在组中的第一个位置,则ffill将用上一个组中的值填充该值.因为np.nan将放在任何排序的末尾,所以同时按ab进行排序可确保您在任何组的开头都不会有np.nan.然后,您可以使用.loc.reindex的初始索引来取回原始订单.

You need to sort by both columns df.sort_values(['a', 'b']).ffill() to ensure robustness. If an np.nan is left in the first position within a group, ffill will fill that with a value from the prior group. Because np.nan will be placed at the end of any sort, sorting by both a and b ensures that you will not have np.nan at the front of any group. You can then .loc or .reindex with the initial index to get back your original order.

这显然比其他建议要慢一点……但是,我认为这是正确,而其他建议则不是.

This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.

演示

demo

考虑数据框df

df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})

print(df)

   a    b
0  1  1.0
1  1  NaN
2  2  NaN
3  2  2.0
4  2  NaN

尝试

df.sort_values('a').ffill()

   a    b
0  1  1.0
1  1  1.0
2  2  1.0  # <--- this is incorrect
3  2  2.0
4  2  2.0

相反

df.sort_values(['a', 'b']).ffill().loc[df.index]

   a    b
0  1  1.0
1  1  1.0
2  2  2.0
3  2  2.0
4  2  2.0

特别说明
如果整个组的值都缺失,这仍然是不正确的

special note
This is still incorrect if an entire group has missing values

这篇关于在 pandas 数据框列中正向填充缺失值的有效解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆