为每个无效值向右扩展numpy掩码n个单元格 [英] Extend numpy mask by n cells to the right for each bad value, efficiently

查看:56
本文介绍了为每个无效值向右扩展numpy掩码n个单元格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个长度为30的数组,其中包含4个错误值.我想为这些不良值创建一个掩码,但是由于我将使用滚动窗口函数,因此我还希望在每个不良值之后将固定数量的后续索引标记为不良.在下面,n = 3:

Let's say I have a length 30 array with 4 bad values in it. I want to create a mask for those bad values, but since I will be using rolling window functions, I'd also like a fixed number of subsequent indices after each bad value to be marked as bad. In the below, n = 3:

我想尽可能高效地执行此操作,因为此例程将在包含数十亿个数据点的大型数据系列上运行多次.因此,我需要尽可能接近numpy向量化的解决方案,因为我想避免python循环.

I would like to do this as efficiently as possible because this routine will be run many times on large data series containing billions of datapoints. Thus I need as close to a numpy vectorized solution as possible because I'd like to avoid python loops.

为避免重复输入,下面是数组:

For avoidance of retyping, here is the array:

import numpy as np
a = np.array([4, 0, 8, 5, 10, 9, np.nan, 1, 4, 9, 9, np.nan, np.nan, 9,\
              9, 8, 0, 3, 7, 9, 2, 6, 7, 2, 9, 4, 1, 1, np.nan, 10])

推荐答案

还有另一个答案!
它只是使用您已经拥有的掩码,并将其本身应用于逻辑版本或移位版本.很好的矢量化和极快的速度! :D

Yet another answer!
It just takes the mask you already have and applies logical or to shifted versions of itself. Nicely vectorized and insanely fast! :D

def repeat_or(a, n=4):
    m = np.isnan(a)
    k = m.copy()

    # lenM and lenK say for each mask how many
    # subsequent Trues there are at least
    lenM, lenK = 1, 1

    # we run until a combination of both masks will give us n or more
    # subsequent Trues
    while lenM+lenK < n:
        # append what we have in k to the end of what we have in m
        m[lenM:] |= k[:-lenM]

        # swap so that m is again the small one
        m, k = k, m

        # update the lengths
        lenM, lenK = lenK, lenM+lenK

    # see how much m has to be shifted in order to append the missing Trues
    k[n-lenM:] |= m[:-n+lenM]

    return k

不幸的是,我无法运行m[i:] |= m[:-i] ...修改和使用掩码自身修改都可能是个坏主意.它确实适用于m[:-i] |= m[i:],但是这是错误的方向.
无论如何,我们现在有了斐波那契式的增长,而不是二次式的增长,仍然比线性增长好.
(我从没想过我会写出与斐波那契数列确实相关的算法,而不会出现一些奇怪的数学问题.)

Unfortunately I couldn't get m[i:] |= m[:-i] running... probably a bad idea to both modify and use the mask to modify itself. It does work for m[:-i] |= m[i:], however this is the wrong direction.
Anyway, instead of quadratic growth we now have Fibonacci-like growth which is still better than linear.
(I never thought I'd actually write an algorithm that is really related to the Fibonacci sequence without being some weird math problem.)

在真实"条件下使用大小为1e6和1e5 NAN数组进行测试:

Testing under "real" conditions with array of size 1e6 and 1e5 NANs:

In [5]: a = np.random.random(size=1e6)

In [6]: a[np.random.choice(np.arange(len(a), dtype=int), 1e5, replace=False)] = np.nan

In [7]: %timeit reduceat(a)
10 loops, best of 3: 65.2 ms per loop

In [8]: %timeit index_expansion(a)
100 loops, best of 3: 12 ms per loop

In [9]: %timeit cumsum_trick(a)
10 loops, best of 3: 17 ms per loop

In [10]: %timeit repeat_or(a)
1000 loops, best of 3: 1.9 ms per loop

In [11]: %timeit agml_indexing(a)
100 loops, best of 3: 6.91 ms per loop

我将把进一步的基准留给托马斯.

I'll leave further benchmarks to Thomas.

这篇关于为每个无效值向右扩展numpy掩码n个单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆