如果 cumsum 大于值,则重新启动 cumsum 并获取索引 [英] Restart cumsum and get index if cumsum more than value

查看:33
本文介绍了如果 cumsum 大于值,则重新启动 cumsum 并获取索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个距离数组 x=[1,2,1,3,3,2,1,5,1,1].

Say I have an array of distances x=[1,2,1,3,3,2,1,5,1,1].

我想从 x 获取索引,其中 cumsum 达到 10,在这种情况下,idx=[4,9].

I want to get the indices from x where cumsum reaches 10, in this case, idx=[4,9].

所以cumsum在条件满足后重新开始.

So the cumsum restarts after the condition are met.

我可以用循环来完成,但循环对于大数组来说很慢,我想知道我是否可以用 vectorized 的方式来完成.

I can do it with a loop, but loops are slow for large arrays and I was wondering if I could do it in a vectorized way.

推荐答案

这是一个带有 numba 和数组初始化的 -

Here's one with numba and array-initialization -

from numba import njit

@njit
def cumsum_breach_numba2(x, target, result):
    total = 0
    iterID = 0
    for i,x_i in enumerate(x):
        total += x_i
        if total >= target:
            result[iterID] = i
            iterID += 1
            total = 0
    return iterID

def cumsum_breach_array_init(x, target):
    x = np.asarray(x)
    result = np.empty(len(x),dtype=np.uint64)
    idx = cumsum_breach_numba2(x, target, result)
    return result[:idx]

时间

包括 @piRSquared 的解决方案 并使用同一帖子中的基准测试设置 -

Including @piRSquared's solutions and using the benchmarking setup from the same post -

In [58]: np.random.seed([3, 1415])
    ...: x = np.random.randint(100, size=1000000).tolist()

# @piRSquared soln1
In [59]: %timeit list(cumsum_breach(x, 10))
10 loops, best of 3: 73.2 ms per loop

# @piRSquared soln2
In [60]: %timeit cumsum_breach_numba(np.asarray(x), 10)
10 loops, best of 3: 69.2 ms per loop

# From this post
In [61]: %timeit cumsum_breach_array_init(x, 10)
10 loops, best of 3: 39.1 ms per loop

Numba:附加与数组初始化

为了仔细看看数组初始化是如何帮助的,这似乎是两个 numba 实现之间的巨大差异,让我们在数组数据上计算这些时间,因为数组数据创建本身对运行时很重要,而且它们都依赖在它 -

For a closer look at how the array-initialization helps, which seems be the big difference between the two numba implementations, let's time these on the array data, as the array data creation was in itself heavy on runtime and they both depend on it -

In [62]: x = np.array(x)

In [63]: %timeit cumsum_breach_numba(x, 10)# with appending
10 loops, best of 3: 31.5 ms per loop

In [64]: %timeit cumsum_breach_array_init(x, 10)
1000 loops, best of 3: 1.8 ms per loop

要强制输出有自己的内存空间,我们可以制作一个副本.虽然不会有很大的改变 -

To force the output to have it own memory space, we can make a copy. Won't change the things in a big way though -

In [65]: %timeit cumsum_breach_array_init(x, 10).copy()
100 loops, best of 3: 2.67 ms per loop

这篇关于如果 cumsum 大于值,则重新启动 cumsum 并获取索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆