为什么下面的示例中的python广播比简单的循环慢? [英] Why python broadcasting in the example below is slower than a simple loop?

查看:106
本文介绍了为什么下面的示例中的python广播比简单的循环慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个向量数组,并计算它们的差异与第一个差异的范数. 使用python广播时,计算比通过简单循环慢得多.为什么?

I have an array of vectors and compute the norm of their diffs vs the first one. When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?

import numpy as np

def norm_loop(M, v):
  n = M.shape[0]
  d = np.zeros(n)
  for i in range(n):
    d[i] = np.sum((M[i] - v)**2)
  return d

def norm_bcast(M, v):
  n = M.shape[0]
  d = np.zeros(n)
  d = np.sum((M - v)**2, axis=1)
  return d

M = np.random.random_sample((1000, 10000))
v = M[0]

%timeit norm_loop(M, v) 
25.9 ms

%timeit norm_bcast(M, v)
38.5 ms

我有Python 3.6.3和Numpy 1.14.2

I have Python 3.6.3 and Numpy 1.14.2

要在google colab中运行示例,请执行以下操作: https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8w view?usp = sharing

To run the example in google colab: https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing

推荐答案

内存访问.

首先,广播版本可以简化为

First off, the broadcast version can be simplified to

def norm_bcast(M, v):
     return np.sum((M - v)**2, axis=1)

这仍然比循环版本慢一些. 现在,传统观点认为,使用广播的矢量化代码应始终保持更快的速度,这在许多情况下是不正确的(我将无耻地插入我的另一个答案

This still runs slightly slower than the looped version. Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?

正如我所说,这取决于内存访问.

As I said, it comes down to memory access.

在广播版本中,从v中减去M的每个元素.到处理M的最后一行时,已经从高速缓存中清除了第一行的处理结果,因此对于第二步,这些差异再次加载到缓存和平方.最后,它们被加载并第三次进行求和处理.由于M很大,因此在每个步骤中都会清除部分缓存,以容纳所有数据.

In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.

在循环版本中,每一行都在一个较小的步骤中被完全处理,从而导致更少的缓存未命中和总体上更快的代码.

In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.

最后,可以使用einsum通过某些数组操作来避免这种情况. 此功能允许混合矩阵乘法和求和. 首先,我要指出的是,与其他numpy相比,该函数的语法相当不直观,潜在的改进通常并不值得花更多精力来理解它. 由于舍入误差,答案也可能略有不同. 在这种情况下,它可以写为

Lastly, it is possible to avoid this with some array operations by using einsum. This function allows mixing matrix multiplications and summations. First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it. The answer may also be slightly different due to rounding errors. In this case it can be written as

def norm_einsum(M, v):
    tmp = M-v
    return np.einsum('ij,ij->i', tmp, tmp)

将其减少为整个数组的两个操作-减法,然后调用einsum,该操作执行平方和求和. 这会带来一些改进:

This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation. This gives a slight improvement:

%timeit norm_bcast(M, v)
30.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit norm_loop(M, v)
25.1 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit norm_einsum(M, v)
21.7 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

这篇关于为什么下面的示例中的python广播比简单的循环慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆