用 pandas 填充最近的已知数据 [英] filling last known data with pandas

查看:74
本文介绍了用 pandas 填充最近的已知数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用具有这种结构的数据框:

I work with dataframes with this kind of structure:

[timestamp][id][A][B][C][D]

由于许多原因,并非所有数据都被检索到,而我最终得到类似

due to a number of things, not all data is retrieved and i end up with something like

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     NaN     NaN     NaN    2623 

在这种情况下,

使用fillna(method="ffill")将使用id=4中的'A'值来填充id=5'A'值.

in this case, using fillna(method="ffill") will use the 'A' value from id=4, to fill the 'A' value of id=5.

但是,我要完成的是从(c0000)csv条目中某个位置的先前已知值填充id=5'A'值. (并以此方式填充所有列中的所有NaN值)

What i want to accomplish however, is to fill the 'A' value of id=5 from its previously known value somewhere along the (700000) csv entries. (and fill all NaN values from all columns this way)

2014-07-24 17:49:00   5   1046.0   -3.0   -239.0   2800.0
...
2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     NaN     NaN     NaN    2623 

成为

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     1046     -3.0     -239.0    2623 

其值来自id = 5的最新已知数据,即2014-07-24 17:49:00

with values coming from the last known data of id=5 which is 2014-07-24 17:49:00

代替

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     1022     0060     -0076    2623 

当我执行fillna

我的第二个解决方案是先执行groupby'id',然后再执行fillna().但这超出了我作为熊猫用户的技能.所以我只是想知道是否还有更好的解决方案.

my second solution was to do a groupby 'id' and then do a fillna(). but that was way over my skill as a pandas user. so I was just wondering if there are any better solutions out there.

推荐答案

首先,我按ID分组,然后在该分组中,用先前的值填充nans.

First I group by the id, and then within that group, I fill the nans with the previous value.

df = pd.DataFrame([['2014-07-24 17:49:00', 5, 1046.0, -3, -239, 2800],
                   ['2015-05-05 15:00:00', 2, np.nan, np.nan,np.nan, 2680],
                   ['2015-05-05 15:00:00', 3,  989, 20, -11, 2680], 
                   ['2015-05-05 15:00:00', 4, 1022, 60, -76, 2600], 
                   ['2015-05-05 15:00:00', 5, np.nan, np.nan, np.nan, 2623]], 
                  columns='timestamp id A B C D'.split())

df.groupby('id').apply(lambda group: group.fillna(method='ffill'))

结果:

             timestamp  id     A   B    C     D
0  2014-07-24 17:49:00   5  1046  -3 -239  2800
1  2015-05-05 15:00:00   2   NaN NaN  NaN  2680
2  2015-05-05 15:00:00   3   989  20  -11  2680
3  2015-05-05 15:00:00   4  1022  60  -76  2600
4  2015-05-05 15:00:00   5  1046  -3 -239  2623

修改: 这是JoeCondron的建议编辑,可能是比我的更好的答案.在他将其添加为答案之前,我将在此处添加它,因为我不想丢失它:

Edit: This was a suggested edit by JoeCondron, that's probably a better answer than mine. I'm going to add it here until he adds it as an answer, because I don't want it to be lost:

df.groupby('id', as_index=False).fillna(method='ffill')

这篇关于用 pandas 填充最近的已知数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆