为什么numpy sum比+运算符慢10倍? [英] Why is numpy sum 10 times slower than the + operator?

查看：234 发布时间：2020/5/18 20:58:46 python performance numpy

本文介绍了为什么numpy sum比+运算符慢10倍?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我很奇怪地发现np.sum比手写的总和慢10倍.

I noticed that very strangely, np.sum is 10x slower than a hand written sum.

轴总和:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1.sum(axis=1)
%timeit test(p1)

每个循环186 µs±4.21 µs(平均±标准偏差，共运行7次，每个循环1000次)

186 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

不带轴的np.sum:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1.sum()
%timeit test(p1)

每个循环17.9 µs±236 ns(平均±标准偏差，共运行7次，每个循环10000次)

17.9 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1[:,0] + p1[:,1]
%timeit test(p1)

每个循环15.8 µs±328 ns(平均±标准偏差，共运行7次，每个循环100000次)

15.8 µs ± 328 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

乘法:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1[:,0]*p1[:,1]
%timeit test(p1)

每个循环15.7 µs±701 ns(平均±标准偏差，共运行7次，每个循环10000次)

15.7 µs ± 701 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我没有看到任何原因.知道为什么吗?我的numpy版本是1.15.3.

I don't see any reason for this. Any idea why? My numpy version is 1.15.3.

10000000:

np.sum (with axis): 202 ms (5 x)
np.sum (without axis): 12 ms
+ : 46 ms (1 x)
* : 44.3 ms

所以我想一定程度上有一些开销...

So I guess there is some overhead playing around, to some extent...

推荐答案

主要区别是计算a.sum(axis=1)时的开销较大.计算减少量(在本例中为sum)不是一件小事:

The main difference is larger overhead when a.sum(axis=1) is calculated. Calculating a reduction (in this case sum) is not a trivial matter:

必须考虑到舍入误差，因此使用成对求和来减少它.
平铺对于更大的阵列很重要，因为它可以充分利用可用的缓存
为了能够使用现代CPU的SIMD指令/乱序执行功能，人们应该并行计算多行

one has to take the round-off errors into account and thus uses pairwise summation to reduce it.
tiling is important for bigger arrays, as it makes the most out of the available cache
In order to be able to use the SIMD-instructions/out-of-order execution abilities of modern CPUs one should calculate multiple rows in parallel

我已经更详细地讨论了上面的主题，例如此处和

I have discussed the topics above in more details for example here and here.

但是，如果仅添加两个元素，则不需要所有这些操作，也不会比单纯的求和好-您获得相同的结果，但开销却少得多，而且速度更快.

However, all this is not needed and not better than a naive summation if there are only two elements to add - you get the same result but with much less overhead and faster.

对于仅1000个元素，调用numpy功能的开销可能比实际执行这1000次加法(或乘法)要高，因为在现代CPU上，流水线式的加法/乘法具有相同的成本)-如您所见， 10 ^ 4的运行时间仅高出约2倍，这无疑表明开销在10 ^ 3中起着更大的作用！在此答案中，将更详细地研究开销和缓存未命中的影响.

For only 1000 elements, the overhead of calling numpy functionality is probably higher than actually doing these 1000 additions (or multiplications for that matter, because on modern CPUs pipelined additions/multiplications have the same cost) -as you can see, that for 10^4 the running time is only about 2 times higher, a sure sign that overhead plays a bigger role for 10^3! In this answer the impact of overhead and cache misses is investigated in more details.

让我们看一下探查器结果，看看上面的理论是否成立(我使用 perf ):

Let's take a look at profiler-result to see whether the theory above holds (I use perf):

对于a.sum(axis=1):

  17,39%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] reduce_loop
  11,41%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] pairwise_sum_DOUBLE
   9,78%  python   multiarray.cpython-36m-x86_64-linux-gnu.so  [.] npyiter_buffered_reduce_iternext_ite
   9,24%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] DOUBLE_add
   4,35%  python   python3.6                                   [.] _PyEval_EvalFrameDefault
   2,17%  python   multiarray.cpython-36m-x86_64-linux-gnu.so  [.] _aligned_strided_to_contig_size8_src
   2,17%  python   python3.6                                   [.] lookdict_unicode_nodummy
   ...

使用reduce_loop和pairwise_sum_DOUBLE的开销占主导.

对于a[:,0]+a[:,1]):

   7,24%  python   python3.6                                   [.] _PyEval_EvalF
   5,26%  python   python3.6                                   [.] PyObject_Mall
   3,95%  python   python3.6                                   [.] visit_decref
   3,95%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] DOUBLE_add
   2,63%  python   python3.6                                   [.] PyDict_SetDef
   2,63%  python   python3.6                                   [.] _PyTuple_Mayb
   2,63%  python   python3.6                                   [.] collect
   2,63%  python   python3.6                                   [.] fast_function
   2,63%  python   python3.6                                   [.] visit_reachab
   1,97%  python   python3.6                                   [.] _PyObject_Gen

如预期的那样:Python开销起很大作用，使用了简单的DOUBLE_add.

As expected: Python overhead plays a big role, a simple DOUBLE_add is used.

调用a.sum()

一次，不是每行都调用reduce_loop，而是仅一次，这意味着开销要少得多.
不会创建新的结果数组，不再需要向内存写入1000个double.

for once, reduce_loop isn't called for every row but only once, which means considerable less overhead.
no new resulting arrays are created, there is no longer need to write 1000 doubles to the memory.

因此可以预见，a.sum()会更快(尽管必须添加2000个而不是1000个-但正如我们所看到的，这主要是关于开销和实际工作的-不是负责大量的运行时间.

so it can be expected, that a.sum() is faster (despite the fact, that 2000 and not 1000 addition must be made - but as we have seen it is mostly about overhead and the actual work - the additions aren't responsible for the big share of the running time).

通过运行获取数据:

perf record python run.py
perf report

和

#run.py
import numpy as np
a=np.random.rand(1000,2)

for _ in range(10000):
  a.sum(axis=1)
  #a[:,0]+a[:,1]

这篇关于为什么numpy sum比+运算符慢10倍?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么numpy sum比+运算符慢10倍? [英] Why is numpy sum 10 times slower than the + operator?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么numpy sum比+运算符慢10倍? [英] Why is numpy sum 10 times slower than the + operator?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭