用 pandas 填充最近的已知数据 [英] filling last known data with pandas
问题描述
我使用具有这种结构的数据框:
I work with dataframes with this kind of structure:
[timestamp][id][A][B][C][D]
由于许多原因,并非所有数据都被检索到,而我最终得到类似
due to a number of things, not all data is retrieved and i end up with something like
2015-05-05 15:00:00 2 NaN NaN NaN 2680
2015-05-05 15:00:00 3 0989 0020 -0011 2680
2015-05-05 15:00:00 4 1022 0060 -0076 2600
2015-05-05 15:00:00 5 NaN NaN NaN 2623
在这种情况下,
使用fillna(method="ffill")
将使用id=4
中的'A'
值来填充id=5
的'A'
值.
in this case, using fillna(method="ffill")
will use the 'A'
value from id=4
, to fill the 'A'
value of id=5
.
但是,我要完成的是从(c0000)csv条目中某个位置的先前已知值填充id=5
的'A'
值. (并以此方式填充所有列中的所有NaN值)
What i want to accomplish however, is to fill the 'A'
value of id=5
from its previously known value somewhere along the (700000) csv entries. (and fill all NaN values from all columns this way)
2014-07-24 17:49:00 5 1046.0 -3.0 -239.0 2800.0
...
2015-05-05 15:00:00 2 NaN NaN NaN 2680
2015-05-05 15:00:00 3 0989 0020 -0011 2680
2015-05-05 15:00:00 4 1022 0060 -0076 2600
2015-05-05 15:00:00 5 NaN NaN NaN 2623
成为
2015-05-05 15:00:00 2 NaN NaN NaN 2680
2015-05-05 15:00:00 3 0989 0020 -0011 2680
2015-05-05 15:00:00 4 1022 0060 -0076 2600
2015-05-05 15:00:00 5 1046 -3.0 -239.0 2623
其值来自id = 5的最新已知数据,即2014-07-24 17:49:00
with values coming from the last known data of id=5 which is 2014-07-24 17:49:00
代替
2015-05-05 15:00:00 2 NaN NaN NaN 2680
2015-05-05 15:00:00 3 0989 0020 -0011 2680
2015-05-05 15:00:00 4 1022 0060 -0076 2600
2015-05-05 15:00:00 5 1022 0060 -0076 2623
当我执行fillna
我的第二个解决方案是先执行groupby
'id',然后再执行fillna()
.但这超出了我作为熊猫用户的技能.所以我只是想知道是否还有更好的解决方案.
my second solution was to do a groupby
'id' and then do a fillna()
. but that was way over my skill as a pandas user. so I was just wondering if there are any better solutions out there.
推荐答案
首先,我按ID分组,然后在该分组中,用先前的值填充nans.
First I group by the id, and then within that group, I fill the nans with the previous value.
df = pd.DataFrame([['2014-07-24 17:49:00', 5, 1046.0, -3, -239, 2800],
['2015-05-05 15:00:00', 2, np.nan, np.nan,np.nan, 2680],
['2015-05-05 15:00:00', 3, 989, 20, -11, 2680],
['2015-05-05 15:00:00', 4, 1022, 60, -76, 2600],
['2015-05-05 15:00:00', 5, np.nan, np.nan, np.nan, 2623]],
columns='timestamp id A B C D'.split())
df.groupby('id').apply(lambda group: group.fillna(method='ffill'))
结果:
timestamp id A B C D
0 2014-07-24 17:49:00 5 1046 -3 -239 2800
1 2015-05-05 15:00:00 2 NaN NaN NaN 2680
2 2015-05-05 15:00:00 3 989 20 -11 2680
3 2015-05-05 15:00:00 4 1022 60 -76 2600
4 2015-05-05 15:00:00 5 1046 -3 -239 2623
修改: 这是JoeCondron的建议编辑,可能是比我的更好的答案.在他将其添加为答案之前,我将在此处添加它,因为我不想丢失它:
Edit: This was a suggested edit by JoeCondron, that's probably a better answer than mine. I'm going to add it here until he adds it as an answer, because I don't want it to be lost:
df.groupby('id', as_index=False).fillna(method='ffill')
这篇关于用 pandas 填充最近的已知数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!