Python:滑动窗口均值,忽略丢失的数据 [英] Python: Sliding windowed mean, ignoring missing data

查看:166
本文介绍了Python:滑动窗口均值,忽略丢失的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当前正在尝试处理实验时间序列数据集,该数据集缺少值.我想在处理nan值的同时计算该数据集随时间的滑动窗口均值.对我而言,正确的方法是在每个窗口内计算有限元素的总和,然后将其除以它们的数量.这种非线性迫使我使用非卷积方法来面对这个问题,因此在该过程的这一部分中我遇到了严重的时间瓶颈.作为我要完成的工作的代码示例,我提出以下内容:

I am currently trying to process an experimental timeseries dataset, which has missing values. I would like to calculate the sliding windowed mean of this dataset along time, while handling nan values. The correct way for me to do it is to compute inside each window the sum of the finite elements and divide it with their number. This nonlinearity forces me to use non convolutional methods to face this problem, thus I have a severe time bottleneck in this part of the process. As a code example of what I am trying to accomplish I present the following:

import numpy as np
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data= np.random.random(50)
data[np.random.randint(0,n-1, n_miss)] = None

#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
    part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]
    mask = np.isfinite(part_data)
    if np.sum(mask) != 0:
        result[count] = np.sum(part_data[mask]) / np.sum(mask)
    else:
        result[count] = None
print 'Input:\t',data
print 'Output:\t',result

输出:

Input:  [ 0.47431791  0.17620835  0.78495647  0.79894688  0.58334064  0.38068788
  0.87829696         nan  0.71589171         nan  0.70359557  0.76113969
  0.13694387  0.32126573  0.22730891         nan  0.35057169         nan
  0.89251851  0.56226354  0.040117           nan  0.37249799  0.77625334
         nan         nan         nan         nan  0.63227417  0.92781944
  0.99416471  0.81850753  0.35004997         nan  0.80743783  0.60828597
         nan  0.01410721         nan         nan  0.6976317          nan
  0.03875394  0.60924066  0.22998065         nan  0.34476729  0.38090961
         nan  0.2021964 ]
Output: [ 0.32526313  0.47849424  0.5867039   0.72241466  0.58765847  0.61410849
  0.62949242  0.79709433  0.71589171  0.70974364  0.73236763  0.53389305
  0.40644977  0.22850617  0.27428732  0.2889403   0.35057169  0.6215451
  0.72739103  0.49829968  0.30119027  0.20630749  0.57437567  0.57437567
  0.77625334         nan         nan  0.63227417  0.7800468   0.85141944
  0.91349722  0.7209074   0.58427875  0.5787439   0.7078619   0.7078619
  0.31119659  0.01410721  0.01410721  0.6976317   0.6976317   0.36819282
  0.3239973   0.29265842  0.41961066  0.28737397  0.36283845  0.36283845
  0.29155301  0.2021964 ]

可以通过numpy操作产生此结果,而无需使用for循环吗?

Can this result be produced by numpy operations, without using a for loop?

推荐答案

这是使用请注意,这将在两侧各增加一个元素.

Please note that this would have one extra element on either sides.

如果您使用的是2D数据,则可以使用

If you are working with 2D data, we can use Scipy's 2D convolution.

方法-

def original_app(data, win_size):
    #Compute mean
    result = np.zeros(data.size)
    for count in range(data.size):
        part_data = data[max(count - (win_size - 1) / 2, 0): \
                 min(count + (win_size + 1) / 2, data.size)]
        mask = np.isfinite(part_data)
        if np.sum(mask) != 0:
            result[count] = np.sum(part_data[mask]) / np.sum(mask)
        else:
            result[count] = None
    return result

def numpy_app(data, win_size):     
    mask = np.isnan(data)
    K = np.ones(win_size,dtype=int)
    out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
    return out[1:-1]  # Slice out the one-extra elems on sides

样品运行-

In [118]: #Construct sample data
     ...: n = 50
     ...: n_miss = 20
     ...: win_size = 3
     ...: data= np.random.random(50)
     ...: data[np.random.randint(0,n-1, n_miss)] = np.nan
     ...: 

In [119]: original_app(data, win_size = 3)
Out[119]: 
array([ 0.88356487,  0.86829731,  0.85249541,  0.83776219,         nan,
               nan,  0.61054015,  0.63111926,  0.63111926,  0.65169837,
        0.1857301 ,  0.58335324,  0.42088104,  0.5384565 ,  0.31027752,
        0.40768907,  0.3478563 ,  0.34089655,  0.55462903,  0.71784816,
        0.93195716,         nan,  0.41635575,  0.52211653,  0.65053379,
        0.76762282,  0.72888574,  0.35250449,  0.35250449,  0.14500637,
        0.06997668,  0.22582318,  0.18621848,  0.36320784,  0.19926647,
        0.24506199,  0.09983572,  0.47595439,  0.79792941,  0.5982114 ,
        0.42389375,  0.28944089,  0.36246113,  0.48088139,  0.71105449,
        0.60234163,  0.40012839,  0.45100475,  0.41768466,  0.41768466])

In [120]: numpy_app(data, win_size = 3)
__main__:36: RuntimeWarning: invalid value encountered in divide
Out[120]: 
array([ 0.88356487,  0.86829731,  0.85249541,  0.83776219,         nan,
               nan,  0.61054015,  0.63111926,  0.63111926,  0.65169837,
        0.1857301 ,  0.58335324,  0.42088104,  0.5384565 ,  0.31027752,
        0.40768907,  0.3478563 ,  0.34089655,  0.55462903,  0.71784816,
        0.93195716,         nan,  0.41635575,  0.52211653,  0.65053379,
        0.76762282,  0.72888574,  0.35250449,  0.35250449,  0.14500637,
        0.06997668,  0.22582318,  0.18621848,  0.36320784,  0.19926647,
        0.24506199,  0.09983572,  0.47595439,  0.79792941,  0.5982114 ,
        0.42389375,  0.28944089,  0.36246113,  0.48088139,  0.71105449,
        0.60234163,  0.40012839,  0.45100475,  0.41768466,  0.41768466])

运行时测试-

In [122]: #Construct sample data
     ...: n = 50000
     ...: n_miss = 20000
     ...: win_size = 3
     ...: data= np.random.random(n)
     ...: data[np.random.randint(0,n-1, n_miss)] = np.nan
     ...: 

In [123]: %timeit original_app(data, win_size = 3)
1 loops, best of 3: 1.51 s per loop

In [124]: %timeit numpy_app(data, win_size = 3)
1000 loops, best of 3: 1.09 ms per loop

In [125]: import pandas as pd

# @jdehesa's pandas solution
In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
100 loops, best of 3: 3.34 ms per loop

这篇关于Python:滑动窗口均值,忽略丢失的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆