在python中向量化6 for循环累积总和 [英] Vectorize a 6 for loop cumulative sum in python

查看:66
本文介绍了在python中向量化6 for循环累积总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数学问题是:

求和中的表达式实际上比上面的表达式复杂得多,但这是一个最小的工作示例,不会使事情复杂化.我已经使用6个嵌套的for循环在Python中编写了此代码,并且正如预期的那样,即使在Numba,Cython和朋友的帮助下,它的性能也很差(真正的形式效果很差,需要评估数百万次).在这里,它是使用嵌套的for循环和累加的总和来写的:

The expression within the sums is actually much more complex than the one above, but this is for a minimal working example to not over-complicate things. I have written this in Python using 6 nested for loops and as expected it performs very badly (the true form performs badly and needs evaluating millions of times), even with help from Numba, Cython and friends. Here it is written using nested for loops and a cumulative sum:

import numpy as np

def func1(a,b,c,d):
    '''
    Minimal working example of multiple summation
    '''
    B = 0
    for ai in range(0,a):
        for bi in range(0,b):
            for ci in range(0,c):
                for di in range(0,d):
                    for ei in range(0,ai+bi):
                        for fi in range(0,ci+di):
                            B += (2)**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*np.math.factorial(ei)


    return a, b, c, d, B

表达式受4个数字控制,对于 func1(4,6,3,4) B 的输出为21769947.844726562.

The expression is controlled with 4 numbers as input and for func1(4,6,3,4) the output for B is 21769947.844726562.

我到处寻找帮助,并找到了一些Stack帖子,其中包括一些示例:

I have looked around for assistance with this and have found several Stack posts, with some examples being:

NumPy中的外部产品:矢量化六个嵌套循环

在Python/具有不同阵列形状的小块

Python矢量化嵌套嵌套循环

我尝试使用从这些有用的帖子中学到的知识,但是经过多次尝试,我一直得出错误的答案.即使对其中一个和进行矢量化处理,也将为真正的问题带来巨大的性能提升,但是和范围不同的事实似乎使我无法接受.有人对如何进行此操作有任何提示吗?

I have tried to use what I have learned from these useful posts, but after many attempts, I keep arriving at the wrong answer. Even vectorizing one of the inner sums will reap a huge performance gain for the real problem, but the fact that the sum ranges are different seems to be throwing me off. Does anyone have any tips on how to progress with this?

推荐答案

最终版本(我认为),它更简洁,更快速地结合了 max9111的回答.

Final (I think) version, a bit cleaner and faster incorporating ideas from max9111's answer.

import numpy as np
from numba import as nb

@nb.njit()
def func1_jit(a, b, c, d):
    # Precompute
    exp_min = 5 - (a + b + c + d)
    exp_max = b
    exp = 2. ** np.arange(exp_min, exp_max + 1)
    fact_e = np.empty((a + b - 2))
    fact_e[0] = 1
    for ei in range(1, len(fact_e)):
        fact_e[ei] = ei * fact_e[ei - 1]
    # Loops
    B = 0
    for ai in range(0, a):
        for bi in range(0, b):
            for ci in range(0, c):
                for di in range(0, d):
                    for ei in range(0, ai + bi):
                        for fi in range(0, ci + di):
                            B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
    return B

这已经比以前的任何选项都快得多,但是我们仍然没有利用多个CPU.一种实现方法是在函数本身内,例如并行化外循环.这会在每次创建线程的调用中增加一些开销,因此对于较小的输入实际上会稍慢一些,但对于较大的值应该会明显更快:

This is already quite faster than any of the previous options, but we are still not taking advantage of multiple CPUs. One way to do it is within the function itself, for example parallelizing the outer loop. This adds some overhead on each call to create the threads, so for small inputs is actually a bit slower, but should be significantly faster for bigger values:

import numpy as np
from numba import as nb

@nb.njit(parallel=True)
def func1_par(a, b, c, d):
    # Precompute
    exp_min = 5 - (a + b + c + d)
    exp_max = b
    exp = 2. ** np.arange(exp_min, exp_max + 1)
    fact_e = np.empty((a + b - 2))
    fact_e[0] = 1
    for ei in range(1, len(fact_e)):
        fact_e[ei] = ei * fact_e[ei - 1]
    # Loops
    B = np.empty((a,))
    for ai in nb.prange(0, a):
        Bi = 0
        for bi in range(0, b):
            for ci in range(0, c):
                for di in range(0, d):
                    for ei in range(0, ai + bi):
                        for fi in range(0, ci + di):
                            Bi += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
        B[ai] = Bi
    return np.sum(B)

或者,如果您有很多要评估函数的地方,也可以在该级别进行并行化.在这里, a_arr b_arr c_arr d_arr 是要对其进行求值的值的向量:从numba的

Or, if you have many points where you want to evaluate the function, you can parallelize at that level too. Here a_arr, b_arr, c_arr and d_arr are vectors of values where the function is to be evaluated:

from numba import as nb

@nb.njit(parallel=True)
def func1_arr(a_arr, b_arr, c_arr, d_arr):
    B_arr = np.empty((len(a_arr),))
    for i in nb.prange(len(B_arr)):
        B_arr[i] = func1_jit(a_arr[i], b_arr[i], c_arr[i], d_arr[i])
    return B_arr

最佳配置取决于您的输入,使用模式,硬件等,因此您可以结合不同的想法以适合您的情况.

The best configuration depends on your inputs, usage pattern, hardware, etc. so you can combine the different ideas to suit your case.

实际上,忘了我之前说过的话.最好的办法是JIT编译算法,但是以一种更有效的方式.首先计算昂贵的零件(我采用了指数和阶乘),然后将其传递给已编译的loopy函数:

Actually, forget what I said before. The best thing is to JIT-compile the algorithm, but in a more effective manner. Compute first the expensive parts (I took the exponential and the factorial) and then pass it to the compiled loopy function:

import numpy as np
from numba import njit

def func1(a, b, c, d):
    exp_min = 5 - (a + b + c + d)
    exp_max = b
    exp = 2. ** np.arange(exp_min, exp_max + 1)
    ee = np.arange(a + b - 2)
    fact_e = scipy.special.factorial(ee)
    return func1_inner(a, b, c, d, exp_min, exp, fact_e)

@njit()
def func1_inner(a, b, c, d, exp_min, exp, fact_e):
    B = 0
    for ai in range(0, a):
        for bi in range(0, b):
            for ci in range(0, c):
                for di in range(0, d):
                    for ei in range(0, ai + bi):
                        for fi in range(0, ci + di):
                            B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
    return B

在我的实验中,这是迄今为止最快的选择,并且占用很少的额外内存(仅是预先计算的值,输入大小线性).

This is, in my experiments, the fastest option by far, and takes little extra memory (only the precomputed values, with size linear on the input).

a, b, c, d = 4, 6, 3, 4
# The original function
%timeit func1_orig(a, b, c, d)
# 2.07 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# The grid-evaluated function
%timeit func1_grid(a, b, c, d)
# 256 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# The precompuation + JIT-compiled function
%timeit func1_jit(a, b, c, d)
# 19.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


总是有一个对整个事物进行网格评估的选项:


Well there is always the option of grid-evaluating the whole thing:

import numpy as np
import scipy.special

def func1(a, b, c, d):
    ai, bi, ci, di, ei, fi = np.ogrid[:a, :b, :c, :d, :a + b - 2, :c + d - 2]
    # Compute
    B = (2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei)
    # Mask out of range elements for last two inner loops
    m = (ei < ai + bi) & (fi < ci + di)
    return np.sum(B * m)

print(func1(4, 6, 3, 4))
# 21769947.844726562

我使用了 scipy.special.因数 ,因为显然 np.factorial 由于某种原因不适用于数组.

I used scipy.special.factorial because apparently np.factorial does not work with arrays for some reason.

很显然,当您增加参数时,此方法的内存成本会非常快地增长.该代码实际上执行了不必要的计算,因为两个内部循环的迭代次数不同,因此(在此方法中)您必须使用最大的迭代,然后删除不需要的迭代.希望矢量化可以弥补这一点.一个小的IPython基准测试:

Obivously, the memory cost of this will grow very fast as you increment the parameters. The code actually performs more computations than necessary, because the two inner loops have varying number of iterations, so (in this method) you have to use the largest and then remove what you don't need. The hope is that vectorization will make up for that. A small IPython benchmark:

a, b, c, d = 4, 6, 3, 4
# func1_orig is the original loop-based version
%timeit func1_orig(a, b, c, d)
# 2.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# func1 here is the vectorized version
%timeit func1(a, b, c, d)
# 210 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


请注意,先前的方法也不是全有或全无.您可以选择仅对某些循环进行网格评估.例如,两个最里面的循环可以像这样向量化:

Note the previous approach is not an all-or-nothing thing either. You can choose to grid-evaluate only some of the loops. For example, the two innermost loops could be vectorized like this:

def func1(a, b, c, d):
    B = 0
    e = np.arange(a + b - 2).reshape((-1, 1))
    f = np.arange(c + d - 2)
    for ai in range(0, a):
        for bi in range(0, b):
            ei = e[:ai + bi]
            for ci in range(0, c):
                for di in range(0, d):
                    fi = f[:ci + di]
                    B += np.sum((2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei))
    return B

这仍然有循环,但是它确实避免了额外的计算,并且内存需求要低得多.哪一个最好取决于我猜输入的大小.在我的测试中,使用原始值(4、6、3、4),这甚至比原始函数还要慢;同样,在这种情况下,似乎在每个循环上为 ei fi 创建新的数组比对预先创建的数组的一个切片进行操作要快.但是,如果将输入乘以4(14、24、12、16),则它比原始输入(约x5)要快得多,尽管它仍然比完全矢量化的输入(约x3)要慢.另一方面,我可以用这个值(在5分钟内)将输入的值按10(40、60、30、40)的比例进行缩放,但由于内存的原因,我不能使用前一个值(因为我没有测试原始功能需要花很长时间).使用 @ numba.jit 会有一点帮助,尽管不是很有用(由于阶乘函数,不能使用 nopython ).您可以根据输入的大小尝试对更多或更少的循环进行矢量化处理.

This still has loops, but it does avoid extra computations, and the memory requirements are much lower. Which one is best depends on the sizes of the inputs I guess. In my tests, with the original values (4, 6, 3, 4) this is even slower than the original function; also, for that case it seems that creating new arrays for ei and fi on each loop is faster than operating on a slice of a pre-created one. However, if you multiply the input by 4 (14, 24, 12, 16) then this is much faster than the original (about x5), although still slower than the fully vectorized one (about x3). On the other hand, I could compute the value for the input scaled by ten (40, 60, 30, 40) with this one (in ~5min) but not with the previous one because of the memory (I didn't test how long it'd take with the original function). Using @numba.jit helps a bit, although not enormously (cannot use nopython due to the factorial function). You can experiment with vectorizing more or less loops depending on the size of your inputs.

这篇关于在python中向量化6 for循环累积总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆