Numpy:单循环矢量化代码比两循环迭代慢 [英] Numpy: Single loop vectorized code slow compared to two loop iteration

查看:123
本文介绍了Numpy:单循环矢量化代码比两循环迭代慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码在两个数组的每个元素上迭代,以计算成对的欧式距离.

The following codes iterates over each element of two array to compute pairwise euclidean distance.

def compute_distances_two_loops(X, Y):
    num_test = X.shape[0]
    num_train = Y.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        for j in range(num_train):
            dists[i][j] = np.sqrt(np.sum((X[i] - Y[j])**2))
    return dists

以下代码具有相同的目的,但具有单循环.

The following code serves the same purpose but with single loop.

def compute_distances_one_loop(X, Y):
    num_test = X.shape[0]
    num_train = Y.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        dists[i, :] = np.sqrt(np.sum((Y - X[i])**2, axis=1))
    return dists

下面是两者的时间比较.

Below are time comparison for both.

two_loop_time = time_function(compute_distances_two_loops, X, Y)
print ('Two loop version took %f seconds' % two_loop_time)

>> Two loop version took 20.3 seconds

one_loop_time = time_function(compute_distances_one_loop, X, Y)
print ('One loop version took %f seconds' % one_loop_time)

>> One loop version took 80.9 seconds

X和Y均为具有形状的numpy.ndarray-

Both X and Y are numpy.ndarray with shape -

X:(500,3000) Y:(5000,3000)

X: (500, 3000) Y: (5000, 3000)

出于直觉,结果不正确,单个循环至少应以相同的速度运行.我在这里想念什么?

Out of intuition the results are not correct, the single loop should run at least with same speed. What am I missing here ?

PS:结果并非来自单次运行.我在不同的机器上运行了多次代码,结果是相似的.

PS: The result is not from a single run. I ran the code number of times, on different machines, the results are similar.

推荐答案

原因是循环体内数组的大小. 在两个循环中,变体适用于3000个元素的两个数组.这很容易至少适合CPU的2级缓存,该缓存比主内存快得多,但也足够大,以至于计算距离比python循环迭代开销慢得多.

The reason is size of arrays within the loop body. In the two loop variant works on two arrays of 3000 elements. This easily fits into at least the level 2 cache of a cpu which is much faster than the main memory but it is also large enough that computing the distance is much slower than the python loop iteration overhead.

第二种情况下,循环体可处理5000 * 3000个元素.如此之大,以至于数据需要在每个计算步骤中进入主存储器(首先将Y-X [i]减法成一个临时数组,将该临时值平方成另一个临时值,然后将其读回以求和). 对于涉及的简单操作,主内存比cpu慢得多,因此即使删除了循环,它也要花费更长的时间. 您可以通过使用写入预分配的临时数组的就地操作来加快速度,但是对于这些数组大小,它仍然比两个循环的变体慢.

The second case the loop body works on 5000 * 3000 elements. This is so much that the data needs go to main memory in each computation step (first the Y-X[i] subtraction into a temporary array, squaring the temporary into another temporary and then read it back to sum it). The main memory is much slower than the cpu for the simple operations involved so it takes much longer despite removing a loop. You could speed it up a bit by using inplace operations writing into preallocated temporary array, but it will still be slower than the two loop variant for these array sizes.

请注意,scipy.spatial.distance.cdist(X, Y)可能将是最快的,因为它根本不需要任何临时工.

Note that scipy.spatial.distance.cdist(X, Y) is probably going to be the fastest, as it does not need any temporaries at all

这篇关于Numpy:单循环矢量化代码比两循环迭代慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆