用相邻日期的平均值填补数据缺口 [英] Fill data gaps with average of data from adjacent days

查看:84
本文介绍了用相邻日期的平均值填补数据缺口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下一个数据帧,其中每30分钟测量一次多个变量.此数据帧内的每个时间序列在可能不同的位置都有间隙.这些差距将由某种运行平均值代替,例如+/- 2天.例如,如果在第4天07:30我缺少数据,我想用第2、3、5和6天在07:30的平均测量值替换NaN条目.请注意,这也是可能的例如,第5天,07:30也是NaN,在这种情况下,应将其排除在平均值之外,以代替第4天丢失的测量值(用np.nanmean应该可以吗?)

Imagine a data frame with multiple variables measured every 30 min. Every time series inside this data frame has gaps at possibly different positions. These gaps are to be replaced by some kind of running mean, lets say +/- 2 days. For example, if at day 4 07:30 I have missing data, I want to replace a NaN entry with the average of the measurements at 07:30 at day 2, 3, 5 and 6. Note that it is also possible that, for example, day 5, 07:30 is also NaN -- in this case, this is should be excluded from the average that is to replace the missing measurement at day 4 (should be possible with np.nanmean?)

我不确定该怎么做.现在,我可能会遍历数据帧中的每一行和每一列,并沿np.mean(df.ix[[i-48, i, i+48], "A"])的行写一个非常糟糕的hack,但是我觉得必须采用更多的pythonic/pandas-y方法?

I am not sure how to do this. Right now, I would probably loop over every single row and column in the data frame and write a really bad hack along the lines of np.mean(df.ix[[i-48, i, i+48], "A"]), but I feel there must be a more pythonic/pandas-y way?

样本数据集:

import numpy as np
import pandas as pd

# generate a 1-week time series
dates = pd.date_range(start="2014-01-01 00:00", end="2014-01-07 00:00", freq="30min")
df = pd.DataFrame(np.random.randn(len(dates),3), index=dates, columns=("A", "B", "C"))

# generate some artificial gaps
df.ix["2014-01-04 10:00":"2014-01-04 11:00", "A"] = np.nan
df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"] = np.nan
df.ix["2014-01-04 09:30":"2014-01-04 15:00", "C"] = np.nan

print df["2014-01-04 08:00":"2014-01-04 16:00"]

                            A         B         C
2014-01-04 08:00:00  0.675720  2.186484 -0.033969
2014-01-04 08:30:00 -0.897217  1.332437 -2.618197
2014-01-04 09:00:00  0.299395  0.837023  1.346117
2014-01-04 09:30:00  0.223051  0.913047       NaN
2014-01-04 10:00:00       NaN  1.395480       NaN
2014-01-04 10:30:00       NaN -0.800921       NaN
2014-01-04 11:00:00       NaN -0.932760       NaN
2014-01-04 11:30:00  0.057219 -0.071280       NaN
2014-01-04 12:00:00  0.215810 -1.099531       NaN
2014-01-04 12:30:00 -0.532563       NaN       NaN
2014-01-04 13:00:00 -0.697872       NaN       NaN
2014-01-04 13:30:00 -0.028541       NaN       NaN
2014-01-04 14:00:00 -0.073426       NaN       NaN
2014-01-04 14:30:00 -1.187419  0.221636       NaN
2014-01-04 15:00:00  1.802449  0.144715       NaN
2014-01-04 15:30:00  0.446615  1.013915 -1.813272
2014-01-04 16:00:00 -0.410670  1.265309 -0.198607

[17 rows x 3 columns]

(一个甚至更复杂的工具也会从平均过程本身中排除通过平均产生的测量值,但这不一定包含在答案中,因为我认为这可能会使事情变得太复杂了. )

(An even more sophisticated tool would also exclude measurements from the averaging procdure that were themselves created by averaging, but that doesn't necessarily have to be included in an answer, since I believe this may make things too complicated for now. )

/edit:我不太满意的示例解决方案:

/edit: A sample solution that I'm not really happy with:

# specify the columns of df where gaps should be filled
cols = ["A", "B", "C"]
for col in cols:
    for idx, rows in df.iterrows():
        if np.isnan(df.ix[idx, col]):
            # replace with mean of adjacent days
            df.ix[idx, col] = np.nanmean(df.ix[[idx-48, idx+48], col]) 

关于此解决方案,我不喜欢两件事:

There is two things I don't like about this solution:

  1. 如果在任何地方都缺少或重复了一行,这将失败.在最后一行中,无论距行47、48或49行如何,我都希望始终减去一天".另外,我可以扩展范围(例如-3天到+3天)而无需手动为索引编写列表.
  2. 如果可能的话,我想摆脱循环.

推荐答案

这应该是一种更快,更简洁的方法.最主要的是使用shift()函数而不是循环.简单的版本是这样:

This should be a faster and more concise way to do it. Main thing is to use the shift() function instead of the loop. Simple version would be this:

df[ df.isnull() ] = np.nanmean( [ df.shift(-48), df.shift(48) ] )

事实证明,很难一概而论,但这似乎可行:

It turned out to be really hard to generalize this, but this seems to work:

df[ df.isnull() ] = np.nanmean( [ df.shift(x).values for x in 
                                     range(-48*window,48*(window+1),48) ], axis=0 )

我不确定,但是怀疑nanmean可能存在错误,这也是您自己缺少值的原因相同.在我看来,如果您给nanmean喂一个数据帧,它就不能处理nans.但是,如果我转换为一个数组(带有.values)并使用axis = 0,那么它似乎可以工作.

I'm not sure, but suspect there might be a bug with nanmean and it's also the same reason you got missing values yourself. It seems to me that nanmean cannot handle nans if you feed it a dataframe. But if I convert to an array (with .values) and use axis=0 then it seems to work.

检查window = 1的结果:

Check results for window=1:

print df.ix["2014-01-04 12:30":"2014-01-04 14:00", "B"]
print df.ix["2014-01-03 12:30":"2014-01-03 14:00", "B"]
print df.ix["2014-01-05 12:30":"2014-01-05 14:00", "B"]    

2014-01-04 12:30:00    0.940193     # was nan, now filled
2014-01-04 13:00:00    0.078160
2014-01-04 13:30:00   -0.662918
2014-01-04 14:00:00   -0.967121

2014-01-03 12:30:00    0.947915     # day before
2014-01-03 13:00:00    0.167218
2014-01-03 13:30:00   -0.391444
2014-01-03 14:00:00   -1.157040

2014-01-05 12:30:00    0.932471     # day after
2014-01-05 13:00:00   -0.010899
2014-01-05 13:30:00   -0.934391
2014-01-05 14:00:00   -0.777203

关于问题2,这取决于您的数据,但是如果您在上述问题之前加上

Regarding problem #2, it will depend on your data but if you precede the above with

df = df.resample('30min')

这将为您提供所有缺失行的nans,然后您可以像其他所有nans一样填写它们.如果可行,那可能是最简单,最快的方法.

that will give you a row of nans for all the missing rows and then you can fill them in the same as all the other nans. That's probably the simplest and fastest way if it works.

或者,您可以对groupby进行操作.我的groupby-fu很虚弱,但是可以带给您它的味道,例如:

Alternatively, you could do something with groupby. My groupby-fu is weak but to give you the flavor of it, something like:

df.groupby( df.index.hour ).fillna(method='pad')

可以正确处理缺少行的问题,但不能解决其他问题.

would correctly deal the issue of missing rows, but not the other things.

这篇关于用相邻日期的平均值填补数据缺口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆