numpy:高效的大点产品 [英] numpy: efficient, large dot products

查看:85
本文介绍了numpy:高效的大点产品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行大型线性代数计算,以将通用协方差矩阵KK_l_obs(形状(NL, NL))转换为缩小空间Kmap_PC(形状(q, q, X, Y))中的协方差矩阵图. /p>

有关如何为每个空间位置构造Kmap_PC的信息保存在其他数组aI0k_l_th中.前两个的形状为(X, Y),第三个的形状为(nl, nl).观测空间与缩小空间之间的转换由特征向量E(形状(q, nl))处理.请注意NL> nl.

Kmap_PC的空间元素的计算公式为:

Kmap_PC[..., X, Y] = E.dot(
    KK_l_obs[I0[X, Y]: I0[X, Y] + nl,
             I0[X, Y]: I0[X, Y] + nl] / a_map[X, Y] + \
    k_l_th).dot(E.T)

理论上可以使用np.einsum直接计算出第一个点积中的位,但会占用数百GB的内存.我现在正在做的是遍历Kmap_PC的空间索引,这非常慢.我还可以使用MPI分配计算结果(由于我有16个可用内核,因此可以使速度提高3-4倍).

我想知道:

(a)如果我可以更有效地进行计算-也许可以将其显式分解为空间元素组;和

(b)是否可以改善这些计算的内存开销.

代码段

import numpy as np
np.random.seed(1)

X = 10
Y = 10
NL = 3000
nl = 1000
q = 7

a_map = 5. * np.random.rand(X, Y)
E = np.random.randn(q, nl)

# construct constant component
m1_ = .05 * np.random.rand(nl, nl)
k_l_th = m1_.dot(m1_)

# construct variable component
m2_ = np.random.rand(NL, NL)
KK_l_obs = m2_.dot(m2_.T)

# where to start in big cov
I0 = np.random.randint(0, NL - nl, (X, Y))

# the slow way
def looping():
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))

    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] / a_map[si[0], si[1]] + k_l_th).dot(E.T)

    return K_PC

def veccalc():
    nl_ = np.arange(nl)[..., None, None]
    I, J = np.meshgrid(nl_, nl_)

    K_s = KK_l_obs[I0[..., None, None] + J, I0[..., None, None] + I]
    K_s = K_s / a_map[..., None, None] + k_l_th[None, None, ...]
    print(K_s.nbytes)

    K_PC = E @ K_s @ E.T
    K_PC = np.moveaxis(K_PC, [0, 1], [-2, -1])

    return K_PC

解决方案

调整#1

在NumPy中通常忽略的一个非常简单的性能调整是避免使用除法和使用乘法.当处理相同形状的数组时,当处理标量到标量或数组到数组的除法时,这并不明显.但是NumPy的隐式广播使其对于允许在不同形状的数组之间或数组与标量之间进行广播的划分变得有趣.对于那些情况,我们可以通过与倒数相乘来获得明显的提升.因此,对于所述问题,我们将预先计算a_map的倒数,并将其用于乘法而不是除法.

所以,从一开始就做:

r_a_map = 1.0/a_map

然后,在嵌套循环中,将其用作:

KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * r_a_map[si[0], si[1]]

调整#2

我们可以在其中使用associative乘法属性:

A*(B + C) = A*B + A*C

因此,k_l_th可在所有迭代中求和,但保持不变,可以从循环外获取并在退出嵌套循环后求和.有效的总和是:E.dot(k_l_th).dot(E.T).因此,我们将其添加到K_PC.


最终确定基准和基准

使用tweak#1和tweak#2,我们最终会得到一种改进的方法,就像这样-

def original_mod_app():
    r_a_map = 1.0/a_map
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))
    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * \
            r_a_map[si[0], si[1]]).dot(E.T)
    return K_PC + E.dot(k_l_th).dot(E.T)[:,:,None,None]

运行时测试与问题中使用的示例设置相同-

In [458]: %timeit original_app()
1 loops, best of 3: 1.4 s per loop

In [459]: %timeit original_mod_app()
1 loops, best of 3: 677 ms per loop

In [460]: np.allclose(original_app(), original_mod_app())
Out[460]: True

因此,在那里我们得到了 2x+ 的加速.

I am trying to perform a large linear-algebra computation to transform a generic covariance matrix KK_l_obs (shape (NL, NL))into a map of covariance matrices in a reduced space Kmap_PC (shape (q, q, X, Y)).

Information about how to construct Kmap_PC for each spatial location is held in other arrays a, I0, and k_l_th. The first two have shapes (X, Y), and the third (nl, nl). The transformation between the observed and reduced space is handed by eingenvectors E (shape (q, nl)). Note that NL > nl.

A spatial element of Kmap_PC is calculated as:

Kmap_PC[..., X, Y] = E.dot(
    KK_l_obs[I0[X, Y]: I0[X, Y] + nl,
             I0[X, Y]: I0[X, Y] + nl] / a_map[X, Y] + \
    k_l_th).dot(E.T)

The bit inside the first dot product could theoretically be computed straight using np.einsum, but would take up hundreds of GB of memory. What I am doing now is looping through the spatial indices of Kmap_PC, which is pretty slow. I could also distribute the calculation using MPI (which could probably give a speedup of 3-4x, since I have 16 cores available).

I'm wondering:

(a) if I can do the computation more efficiently--perhaps explicitly breaking it down into groups of spatial elements; and

(b) if I can improve the memory overhead for those calculations.

Code snippet

import numpy as np
np.random.seed(1)

X = 10
Y = 10
NL = 3000
nl = 1000
q = 7

a_map = 5. * np.random.rand(X, Y)
E = np.random.randn(q, nl)

# construct constant component
m1_ = .05 * np.random.rand(nl, nl)
k_l_th = m1_.dot(m1_)

# construct variable component
m2_ = np.random.rand(NL, NL)
KK_l_obs = m2_.dot(m2_.T)

# where to start in big cov
I0 = np.random.randint(0, NL - nl, (X, Y))

# the slow way
def looping():
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))

    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] / a_map[si[0], si[1]] + k_l_th).dot(E.T)

    return K_PC

def veccalc():
    nl_ = np.arange(nl)[..., None, None]
    I, J = np.meshgrid(nl_, nl_)

    K_s = KK_l_obs[I0[..., None, None] + J, I0[..., None, None] + I]
    K_s = K_s / a_map[..., None, None] + k_l_th[None, None, ...]
    print(K_s.nbytes)

    K_PC = E @ K_s @ E.T
    K_PC = np.moveaxis(K_PC, [0, 1], [-2, -1])

    return K_PC

解决方案

Tweak #1

One very simple performance tweak that's mostly ignored in NumPy is avoiding the use of division and using multiplication. This is not noticeable when dealing with scalar to scalar or array to array divisions when dealing with equal shaped arrays. But NumPy's implicit broadcasting makes it interesting for divisions that allow for broadcasting between arrays of different shapes or between an array and scalar. For those cases, we could get noticeable boost using multiplication with the reciprocal numbers. Thus, for the stated problem, we would pre-compute the reciprocal of a_map and use those for multiplications in place of divisions.

So, at the start do :

r_a_map = 1.0/a_map

Then, within the nested loops, use it as :

KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * r_a_map[si[0], si[1]]

Tweak #2

We could use associative property of multiplication there :

A*(B + C) = A*B + A*C

Thus, k_l_th that is summed across all iterations but stays constant could be taken outside of the loop and summed up after getting out of the nested loops. It's effective summation would be : E.dot(k_l_th).dot(E.T). So, we would add this to K_PC.


Finalizing and benchmarking

Using tweak #1 and tweak#2, we would end up with a modified approach, like so -

def original_mod_app():
    r_a_map = 1.0/a_map
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))
    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * \
            r_a_map[si[0], si[1]]).dot(E.T)
    return K_PC + E.dot(k_l_th).dot(E.T)[:,:,None,None]

Runtime test with the same sample setup as used in the question -

In [458]: %timeit original_app()
1 loops, best of 3: 1.4 s per loop

In [459]: %timeit original_mod_app()
1 loops, best of 3: 677 ms per loop

In [460]: np.allclose(original_app(), original_mod_app())
Out[460]: True

So, we are getting a speedup of 2x+ there.

这篇关于numpy:高效的大点产品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆