为什么 np.hypot 和 np.subtract.outer 与普通广播相比非常快?使用 Numba 并行加速 numpy 进行距离矩阵计算 [英] Why np.hypot and np.subtract.outer very fast compared to vanilla broadcast ? Using Numba for speedup numpy in parallel for distance matrix calculation

查看:28
本文介绍了为什么 np.hypot 和 np.subtract.outer 与普通广播相比非常快?使用 Numba 并行加速 numpy 进行距离矩阵计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两大组二维点,我需要计算一个距离矩阵.

I have two large sets of 2D points and I need to calculate a distance matrix.

我需要它在 python 中运行得很快,所以很明显我使用了 numpy.最近学习了 numpy 广播并使用了它,而不是在 python 中循环 numpy 会在 C 中进行.

I needed it to be fast and in python, so obviously I used numpy. I recently learned about numpy broadcasting and used it, rather than looping in python numpy will do it in C.

我真的认为广播就是我所需要的,直到我看到其他方法比普通广播更好,我有两种计算距离矩阵的方法,但我不明白为什么一种比另一种更好.

I really thought broadcasting is all I need until I see other methods working better than vanilla broadcasting, I have two ways I calculate distance matrix and I don't understand why one is better than other.

我在这里查找https://github.com/numpy/numpy/issues/14761 和我有矛盾的结果.

I looked up here https://github.com/numpy/numpy/issues/14761 and I have contradictory results.

以下是两种计算距离矩阵的方法

单元格 [3, 4, 6] 和 [8, 9] 都计算距离矩阵,但 3+4 使用减法.outer 比使用 vanilla 广播的 8 快,使用 hypot 的 6 比 9 快这是简单的方法.我没有尝试在 python 循环中假设它永远不会完成.

Cells [3, 4, 6] and [8, 9] both calculate the distance matrix, but 3+4 uses subtract.outer is way faster than 8 which uses vanilla broadcast and 6 which uses hypot is way faster than 9 which is simple way. I did not try in python loop assumming it will never finish.

我想知道

1.有没有更快的方法来计算距离矩阵(可能是 scikit-learn 或 scipy)?

2.为什么hypot 和subtract.outer 这么快?

为了方便起见,我还附上了代码片段 tp run 整个事情,我更改了种子以防止缓存恢复

I have also attached snippet tp run whole thing for convenience and I change seed to prevent cache resue

### Cell 1
import numpy as np

np.random.seed(858442)

### Cell 2
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 2.02 ms, sys: 1.4 ms, total: 3.42 ms
Wall time: 1.84 ms

### Cell 3
%%time
d0 = np.subtract.outer(obs[:,0], interp[:,0])

CPU times: user 2.46 s, sys: 1.97 s, total: 4.42 s
Wall time: 4.42 s

### Cell 4
%%time
d1 = np.subtract.outer(obs[:,1], interp[:,1])

CPU times: user 3.1 s, sys: 2.7 s, total: 5.8 s
Wall time: 8.34 s

### Cell 5
%%time
h = np.hypot(d0, d1)

CPU times: user 12.7 s, sys: 24.6 s, total: 37.3 s
Wall time: 1min 6s

### Cell 6
np.random.seed(773228)

### Cell 7
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 1.84 ms, sys: 1.56 ms, total: 3.4 ms
Wall time: 2.03 ms

### Cell 8
%%time
d = obs[:, np.newaxis, :] - interp
d0, d1 = d[:, :, 0], d[:, :, 1]

CPU times: user 22.7 s, sys: 8.24 s, total: 30.9 s
Wall time: 33.2 s

### Cell 9
%%time
h = np.sqrt(d0**2 + d1**2)

CPU times: user 29.1 s, sys: 2min 12s, total: 2min 41s
Wall time: 6min 10s

更新感谢 Jérôme Richard 这里

  • Stackoverflow 从不让人失望
  • 使用 numba
  • 有一种更快的方法
  • 它有及时的编译器,可以将 python 片段转换为快速的机器代码,第一次使用它会比后续使用慢一点,因为它会编译.但即使是第一次,对于 (49000, 12000) 矩阵,njit parallel 也以 9 倍的余量击败了 hypot +subtract.outer
    • 确保每次运行脚本时使用不同的种子
    import sys
    import time
    
    import numba as nb
    import numpy as np
    
    np.random.seed(int(sys.argv[1]))
    
    d0 = np.random.random((49000, 2))
    d1 = np.random.random((12000, 2))
    
    def f1(d0, d1):
        print('Numba without parallel')
        res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
        for i in nb.prange(d0.shape[0]):
            for j in range(d1.shape[0]):
                res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
        return res
    
    # Add eager compilation, compiles before hand
    @nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
    def f2(d0, d1):
        print('Numba with parallel')
        res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
        for i in nb.prange(d0.shape[0]):
            for j in range(d1.shape[0]):
                res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
        return res
    
    def f3(d0, d1):
        print('hypot + subtract.outer')
        np.hypot(
            np.subtract.outer(d0[:,0], d1[:,0]),
            np.subtract.outer(d0[:,1], d1[:,1])
        )
    
    if __name__ == '__main__':
        s1 = time.time()
        eval(f'{sys.argv[2]}(d0, d1)')
        print(time.time() - s1)
    

    (base) ~/xx@xx:~/xx$ python3 test.py 523432 f3
    hypot + subtract.outer
    9.79756784439087
    (base) xx@xx:~/xx$ python3 test.py 213622 f2
    Numba with parallel
    0.3393140316009521
    

    我会更新这篇文章以获得进一步的发展,如果我找到更快的方法

    推荐答案

    首先,d0d1 分别取 50000 x 30000 x 8 = 12GB 这是相当大的.确保您有超过 100 GB 的内存,因为这是整个脚本所需要的!这是大量内存.如果您没有足够的内存,操作系统将使用存储设备(例如交换)来存储多余的数据,但速度要慢得多.实际上,Cell-4 没有理由比 Cell-3 慢,我猜您已经没有足够的内存来(完全)将 d1 存储在 RAM 中,而 d0似乎(大部分)适合内存.当两者都可以放入 RAM 时,我的机器上没有区别(也可以颠倒操作顺序来检查这一点).这也解释了为什么进一步的操作往往会变慢.

    First of all, d0 and d1 takes each 50000 x 30000 x 8 = 12 GB which is pretty big. Make sure you have more than 100 GB of memory because this is what the whole script requires! This is a huge amount of memory. If you do not have enough memory, the operating system will use a storage device (eg. swap) to store excess data which is much slower. Actually, there is no reason Cell-4 is slower than Cell-3 and I guess that you already do not have enough memory to (fully) store d1 in RAM while d0 seems to fit (mostly) in memory. There is not difference on my machine when both can fit in RAM (one can also reverse the order of the operations to check this). This also explain why further operation tends to get slower.

    话虽如此,与单元格 3+4+5 相比,单元格 8+9 也较慢,因为它们创建临时数组并且需要更多的内存传递来计算结果.事实上,表达式 np.sqrt(d0**2 + d1**2) 首先在内存中计算 d0**2 产生一个新的 12 GB 临时数组,然后计算 d1**2 得到另一个 12 GB 的临时数组,然后将两个临时数组求和,生成另一个新的 12 GB 临时数组,最后计算平方根,得到另一个 12 GB临时数组.这可能需要多达 48 GB 的内存,并需要 4 次读写内存绑定传递.这效率不高,并且不会有效地使用 CPU/RAM(例如 CPU 缓存).

    That being said, Cells 8+9 are also slower because they create temporary arrays and need more memory passes to compute the result than Cells 3+4+5. Indeed, the expression np.sqrt(d0**2 + d1**2) first compute d0**2 in memory resulting in a new 12 GB temporary array, then compute d1**2 resulting in another 12 GB temporary array, then perform the sum of the two temporary array to produce another new 12 GB temporary array, and finally compute the square-root resulting in another 12 GB temporary array. This can required up to 48 GB of memory and require 4 read-write memory-bound passes. This is not efficient and do not use the CPU/RAM efficiently (eg. CPU cache).

    有一种更快的实现,包括使用 Numba 的 JIT 一次性完成整个计算并并行执行.下面是一个例子:

    There is a much faster implementation consisting in doing the whole computation in 1 pass and in parallel using the Numba's JIT. Here is an example:

    import numba as nb
    @nb.njit(parallel=True)
    def distanceMatrix(a, b):
        res = np.empty((a.shape[0], b.shape[0]), dtype=a.dtype)
        for i in nb.prange(a.shape[0]):
            for j in range(b.shape[0]):
                res[i, j] = np.sqrt((a[i, 0] - b[j, 0])**2 + (a[i, 1] - b[j, 1])**2)
        return res
    

    此实现使用3 倍的内存(仅 12 GB),并且比使用 subtract.outer 的实现快得多.事实上,由于交换,Cell 3+4+5 需要几分钟,而这个需要 1.3 秒!

    This implementation use 3 times less memory (only 12 GB) and is much faster than the one using subtract.outer. Indeed, due to swapping, Cell 3+4+5 takes few minutes while this one takes 1.3 second!

    要点是内存访问和临时数组一样昂贵.需要避免在处理大量缓冲区时在内存中使用多次传递,并在执行的计算不是微不足道的时候(例如通过使用数组块)利用 CPU 缓存.

    The takeaway is that memory accesses are expensive as well as temporary array. One need to avoid using multiple passes in memory while working on huge buffers and take advantage of CPU caches when the computation performed is not trivial (for example by using array chunks).

    这篇关于为什么 np.hypot 和 np.subtract.outer 与普通广播相比非常快?使用 Numba 并行加速 numpy 进行距离矩阵计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆