内插(或外推) pandas 数据框中的小缝隙 [英] Interpolate (or extrapolate) only small gaps in pandas dataframe
问题描述
我有一个熊猫DataFrame,其时间作为索引(1分钟频率),并包含几列数据.有时数据包含NaN.如果是这样,我只想在差距不超过5分钟时进行插值.在这种情况下,这将是最多5个连续的NaN.数据可能看起来像这样(几个测试用例,它们显示了问题):
I have a pandas DataFrame with time as index (1 min Freq) and several columns worth of data. Sometimes the data contains NaN. If so, I want to interpolate only if the gap is not longer than 5 Minutes. In this case this would be a maximum of 5 consecutive NaNs. The data may look like this (several test cases, which show the problems):
import numpy as np
import pandas as pd
from datetime import datetime
start = datetime(2014,2,21,14,50)
data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
'd': [np.NaN]*8,
'e': [np.NaN]*7 + [2330.3],
'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
'g': [2330.3] + [np.NaN]*7,
'h': [2330.3] + [np.NaN]*6 + [2777.7]})
它看起来像这样:
In [147]: data
Out[147]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
我知道data.interpolate()
,但是它有一些缺陷,因为它会产生此结果,这对于a-e列是好的,但是对于f-h列,它由于不同的原因而失败:
I am aware of data.interpolate()
but it has several flaws, as it produces this result, which is good for the columns a-e, but for the columns f-h it fails for different reasons::
a b c d e f g \
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN NaN 2330.3
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN 2330.3
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN 2330.3
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 2330.3
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 2330.3
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 2330.3
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 2330.3
h
2014-02-21 14:50:00 2330.300000
2014-02-21 14:51:00 2394.214286
2014-02-21 14:52:00 2458.128571
2014-02-21 14:53:00 2522.042857
2014-02-21 14:54:00 2585.957143
2014-02-21 14:55:00 2649.871429
2014-02-21 14:56:00 2713.785714
2014-02-21 14:57:00 2777.700000
f)差距在开始时由价值4分钟的NaN组成,应将其替换为该值2763.0(即,在时间上向后推算)
f) The gap consists of 4 minutes worth of NaNs in the beginning, they should be replaced by that value 2763.0 (i.e. extrapolating backwards in time)
g)间隔超过5分钟,但仍然可以推断
g) The gap is longer than 5 minutes but still it gets extrapolated
h)间隙超过5分钟,但仍会插入间隙.
h) The gap is longer than 5 minutes but still the gap is interpolated.
我理解这些原因,当然我在任何地方都没有指定它不应插补超过5分钟的时间间隔.我了解interpolate
仅在时间上向前推断,但我希望它在时间上也向后推断.有什么已知的方法可以解决我的问题,而无需重新发明轮子吗?
I understand those reasons, of course I nowhere specified that it should not interpolate longer gaps than 5 minutes. I understand that interpolate
only extrapolates forward in time, but I want it to also extrapolate backward in time. Is there any known methods I can use for my problem, without reinventing the wheel?
方法data.interpolate
接受输入参数limit
,该参数定义要由插值替换的连续NaN的最大数量.但这仍然可以插值到极限,但是在这种情况下,我想继续使用所有NaN.
The method data.interpolate
accepts the input parameter limit
, which defines the maximum number of consecutive NaNs to be substituted by interpolation. But this still interpolates up to the limit, but I want to go on with all NaNs in that case.
推荐答案
所以这是一个应该解决该问题的掩码.只需interpolate
,然后应用该掩码即可将适当的值重置为NaN.老实说,这比我意识到的要多得多,因为我不得不循环遍历每一列,但是如果没有我提供一些"ones"之类的虚拟列,groupby就无法正常工作.
So here is a mask that ought to solve the problem. Just interpolate
and then apply the mask to reset appropriate values to NaN. Honestly, this was a bit more work than I realized it would be because I had to loop through each column but then groupby didn't quite work without me providing some dummy columns like 'ones'.
无论如何,我可以解释是否有任何不清楚的地方,但实际上只有几行很难理解.请参阅此处,以获取更多有关df['new']
行上的提示或只是打印出单独的行以更好地了解发生了什么.
Anyway, I can explain if anything is unclear but really only a couple of the lines are somewhat hard to understand. See here for a little bit more of an explanation of the trick on the df['new']
line or just print out individual lines to better see what is going on.
mask = data.copy()
for i in list('abcdefgh'):
df = pd.DataFrame( data[i] )
df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
df['ones'] = 1
mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()
In [7]: data
Out[7]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
In [8]: mask
Out[8]:
a b c d e f g h
2014-02-21 14:50:00 True True True False False True True True
2014-02-21 14:51:00 True True True False False True False False
2014-02-21 14:52:00 True True True False False True False False
2014-02-21 14:53:00 True True True False False True False False
2014-02-21 14:54:00 True True True False False True False False
2014-02-21 14:55:00 True True True False False True False False
2014-02-21 14:56:00 True True True False False True False False
2014-02-21 14:57:00 True True True False True True False True
如果您对推断没有任何幻想,那么从那里开始很容易:
It's easy from there if you don't do anything fancier with respect to extrapolation:
In [9]: data.interpolate().bfill()[mask]
Out[9]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN 2763.0 2330.3 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 NaN 2777.7
编辑以添加:通过将一些内容移出循环,这是一种更快的方法(此示例数据的大约2倍),并且稍微简单一些:
Edit to add: Here's a faster (about 2x on this sample data) and slightly simpler way, by moving some stuff outside of the loop:
mask = data.copy()
grp = ((mask.notnull() != mask.shift().notnull()).cumsum())
grp['ones'] = 1
for i in list('abcdefgh'):
mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()
这篇关于内插(或外推) pandas 数据框中的小缝隙的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!