使用Numpy和Cython加速距离矩阵计算 [英] Speeding up distance matrix computation with Numpy and Cython

查看:91
本文介绍了使用Numpy和Cython加速距离矩阵计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑维数为NxM的numpy数组A.目的是计算欧几里得距离矩阵D,其中每个元素D [i,j]是行i和j之间的欧几里德距离.最快的方法是什么?这不完全是我需要解决的问题,但这是我正在尝试做的一个很好的例子(通常,可以使用其他距离度量标准).

Consider a numpy array A of dimensionality NxM. The goal is to compute Euclidean distance matrix D, where each element D[i,j] is Eucledean distance between rows i and j. What is the fastest way of doing it? This is not exactly the problem I need to solve, but it's a good example of what I'm trying to do (in general, other distance metrics could be used).

这是到目前为止我能想到的最快的方法:

This is the fastest I could come up with so far:

n = A.shape[0]
D = np.empty((n,n))
for i in range(n):
    D[i] = np.sqrt(np.square(A-A[i]).sum(1))

但这是最快的方法吗?我主要关注for循环.我们可以用Cython击败它吗?

But is it the fastest way? I'm mainly concerned about the for loop. Can we beat this with, say, Cython?

为避免循环,我尝试使用广播,并执行以下操作:

To avoid looping, I tried to use broadcasting, and do something like this:

D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))

但事实证明这不是一个好主意,因为构建维度为NxNxM的中间3D数组会产生一些开销,因此性能会更差.

But it turned out to be a bad idea, because there's some overhead in construction an intermediate 3D array of dimensionality NxNxM, so the performance is worse.

我尝试过Cython.但是我是Cython的新手,所以我不知道自己的尝试有多好:

I tried Cython. But I am a newbie in Cython, so I don't know how good is my attempt:

def dist(np.ndarray[np.int32_t, ndim=2] A):
    cdef int n = A.shape[0]    
    cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64)      
    cdef int i = 0    
    for i in range(n):  
        dm[i] = np.sqrt(np.square(A-A[i]).sum(1)).astype(np.float64)              
    return dm 

上面的代码比Python的for循环要慢一些.我对Cython不太了解,但是我认为我至少可以达到与for循环+ numpy相同的性能.我想知道正确的方法是否有可能实现一些显着的性能改进?还是是否有其他方法可以加快速度(不涉及并行计算)?

The above code was a bit slower than Python's for loop. I don't know much about Cython, but I assume I could achieve at least the same performance as the for loop + numpy. And I am wondering whether it is possible to achieve some noticeable performance improvement when done the right way? Or whether there's some other way to speed this up (not involving parallel computations)?

推荐答案

Cython的关键在于避免使用Python对象和函数调用,包括对numpy数组进行矢量化操作.这通常意味着手动写出所有循环并一次对单个数组元素进行操作.

The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.

这里有一个非常有用的教程,其中涵盖了转换numpy代码的过程到Cython并对其进行优化.

There's a very useful tutorial here that covers the process of converting numpy code to Cython and optimizing it.

在这里,您可以快速了解距离功能的更优化的Cython版本:

Here's a quick stab at a more optimized Cython version of your distance function:

import numpy as np
cimport numpy as np
cimport cython

# don't use np.sqrt - the sqrt function from the C standard library is much
# faster
from libc.math cimport sqrt

# disable checks that ensure that array indices don't go out of bounds. this is
# faster, but you'll get a segfault if you mess up your indexing.
@cython.boundscheck(False)
# this disables 'wraparound' indexing from the end of the array using negative
# indices.
@cython.wraparound(False)
def dist(double [:, :] A):

    # declare C types for as many of our variables as possible. note that we
    # don't necessarily need to assign a value to them at declaration time.
    cdef:
        # Py_ssize_t is just a special platform-specific type for indices
        Py_ssize_t nrow = A.shape[0]
        Py_ssize_t ncol = A.shape[1]
        Py_ssize_t ii, jj, kk

        # this line is particularly expensive, since creating a numpy array
        # involves unavoidable Python API overhead
        np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double)

        double tmpss, diff

    # another advantage of using Cython rather than broadcasting is that we can
    # exploit the symmetry of D by only looping over its upper triangle
    for ii in range(nrow):
        for jj in range(ii + 1, nrow):
            # we use tmpss to accumulate the SSD over each pair of rows
            tmpss = 0
            for kk in range(ncol):
                diff = A[ii, kk] - A[jj, kk]
                tmpss += diff * diff
            tmpss = sqrt(tmpss)
            D[ii, jj] = tmpss
            D[jj, ii] = tmpss  # because D is symmetric

    return D

我将其保存在名为fastdist.pyx的文件中.我们可以使用pyximport来简化构建过程:

I saved this in a file called fastdist.pyx. We can use pyximport to simplify the build process:

import pyximport
pyximport.install()
import fastdist
import numpy as np

A = np.random.randn(100, 200)

D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
D2 = fastdist.dist(A)

print np.allclose(D1, D2)
# True

至少可以正常工作.让我们使用%timeit魔术进行一些基准测试:

So it works, at least. Let's do some benchmarking using the %timeit magic:

%timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
# 100 loops, best of 3: 10.6 ms per loop

%timeit fastdist.dist(A)
# 100 loops, best of 3: 1.21 ms per loop

约9倍的加速速度不错,但不能真正改变游戏规则.但是,正如您所说,广播方法的最大问题是构造中间阵列的内存需求.

A ~9x speed-up is nice, but not really a game-changer. As you said, though, the big problem with the broadcasting approach is the memory requirements of constructing the intermediate array.

A2 = np.random.randn(1000, 2000)
%timeit fastdist.dist(A2)
# 1 loops, best of 3: 1.36 s per loop

我不建议尝试使用广播...

I wouldn't recommend trying that using broadcasting...

我们可以做的另一件事是使用prange函数在最外层的循环上并行处理此问题:

Another thing we could do is parallelize this over the outermost loop, using the prange function:

from cython.parallel cimport prange

...

for ii in prange(nrow, nogil=True, schedule='guided'):
...

为了编译并行版本,您需要告诉编译器启用OpenMP.我还没有弄清楚如何使用pyximport来做到这一点,但是如果您使用的是gcc,您可以像这样手动编译它:

In order to compile the parallel version you'll need to tell the compiler to enable OpenMP. I haven't figured out how to do this using pyximport, but if you're using gcc you could compile it manually like this:

$ cython fastdist.pyx
$ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \
   -Wall -fno-strict-aliasing  -I/usr/include/python2.7 -o fastdist.so fastdist.c

具有8个线程的并行性:

With parallelism, using 8 threads:

%timeit D2 = fastdist.dist_parallel(A2)
1 loops, best of 3: 509 ms per loop

这篇关于使用Numpy和Cython加速距离矩阵计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆