为什么numpy sum比+运算符慢10倍? [英] Why is numpy sum 10 times slower than the + operator?

查看:234
本文介绍了为什么numpy sum比+运算符慢10倍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很奇怪地发现np.sum比手写的总和慢10倍.

I noticed that very strangely, np.sum is 10x slower than a hand written sum.

轴总和:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1.sum(axis=1)
%timeit test(p1)

每个循环186 µs±4.21 µs(平均±标准偏差,共运行7次,每个循环1000次)

186 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

不带轴的np.sum:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1.sum()
%timeit test(p1)

每个循环17.9 µs±236 ns(平均±标准偏差,共运行7次,每个循环10000次)

17.9 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

+:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1[:,0] + p1[:,1]
%timeit test(p1)

每个循环15.8 µs±328 ns(平均±标准偏差,共运行7次,每个循环100000次)

15.8 µs ± 328 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

乘法:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1[:,0]*p1[:,1]
%timeit test(p1)

每个循环15.7 µs±701 ns(平均±标准偏差,共运行7次,每个循环10000次)

15.7 µs ± 701 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我没有看到任何原因.知道为什么吗?我的numpy版本是1.15.3.

I don't see any reason for this. Any idea why? My numpy version is 1.15.3.

10000000:

np.sum (with axis): 202 ms (5 x)
np.sum (without axis): 12 ms
+ : 46 ms (1 x)
* : 44.3 ms 

所以我想一定程度上有一些开销...

So I guess there is some overhead playing around, to some extent...

推荐答案

主要区别是计算a.sum(axis=1)时的开销较大.计算减少量(在本例中为sum)不是一件小事:

The main difference is larger overhead when a.sum(axis=1) is calculated. Calculating a reduction (in this case sum) is not a trivial matter:

  • 必须考虑到舍入误差,因此使用成对求和来减少它.
  • 平铺对于更大的阵列很重要,因为它可以充分利用可用的缓存
  • 为了能够使用现代CPU的SIMD指令/乱序执行功能,人们应该并行计算多行
  • one has to take the round-off errors into account and thus uses pairwise summation to reduce it.
  • tiling is important for bigger arrays, as it makes the most out of the available cache
  • In order to be able to use the SIMD-instructions/out-of-order execution abilities of modern CPUs one should calculate multiple rows in parallel

我已经更详细地讨论了上面的主题,例如此处

I have discussed the topics above in more details for example here and here.

但是,如果仅添加两个元素,则不需要所有这些操作,也不会比单纯的求和好-您获得相同的结果,但开销却少得多,而且速度更快.

However, all this is not needed and not better than a naive summation if there are only two elements to add - you get the same result but with much less overhead and faster.

对于仅1000个元素,调用numpy功能的开销可能比实际执行这1000次加法(或乘法)要高,因为在现代CPU上,流水线式的加法/乘法具有相同的成本)-如您所见, 10 ^ 4的运行时间仅高出约2倍,这无疑表明开销在10 ^ 3中起着更大的作用!在此答案中,将更详细地研究开销和缓存未命中的影响.

For only 1000 elements, the overhead of calling numpy functionality is probably higher than actually doing these 1000 additions (or multiplications for that matter, because on modern CPUs pipelined additions/multiplications have the same cost) -as you can see, that for 10^4 the running time is only about 2 times higher, a sure sign that overhead plays a bigger role for 10^3! In this answer the impact of overhead and cache misses is investigated in more details.

让我们看一下探查器结果,看看上面的理论是否成立(我使用 perf ):

Let's take a look at profiler-result to see whether the theory above holds (I use perf):

对于a.sum(axis=1):

  17,39%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] reduce_loop
  11,41%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] pairwise_sum_DOUBLE
   9,78%  python   multiarray.cpython-36m-x86_64-linux-gnu.so  [.] npyiter_buffered_reduce_iternext_ite
   9,24%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] DOUBLE_add
   4,35%  python   python3.6                                   [.] _PyEval_EvalFrameDefault
   2,17%  python   multiarray.cpython-36m-x86_64-linux-gnu.so  [.] _aligned_strided_to_contig_size8_src
   2,17%  python   python3.6                                   [.] lookdict_unicode_nodummy
   ...

使用reduce_looppairwise_sum_DOUBLE的开销占主导.

对于a[:,0]+a[:,1]):

   7,24%  python   python3.6                                   [.] _PyEval_EvalF
   5,26%  python   python3.6                                   [.] PyObject_Mall
   3,95%  python   python3.6                                   [.] visit_decref
   3,95%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] DOUBLE_add
   2,63%  python   python3.6                                   [.] PyDict_SetDef
   2,63%  python   python3.6                                   [.] _PyTuple_Mayb
   2,63%  python   python3.6                                   [.] collect
   2,63%  python   python3.6                                   [.] fast_function
   2,63%  python   python3.6                                   [.] visit_reachab
   1,97%  python   python3.6                                   [.] _PyObject_Gen

如预期的那样:Python开销起很大作用,使用了简单的DOUBLE_add.

As expected: Python overhead plays a big role, a simple DOUBLE_add is used.

调用a.sum()

  • 一次,不是每行都调用reduce_loop,而是仅一次,这意味着开销要少得多.
  • 不会创建新的结果数组,不再需要向内存写入1000个double.
  • for once, reduce_loop isn't called for every row but only once, which means considerable less overhead.
  • no new resulting arrays are created, there is no longer need to write 1000 doubles to the memory.

因此可以预见,a.sum()会更快(尽管必须添加2000个而不是1000个-但正如我们所看到的,这主要是关于开销和实际工作的-不是负责大量的运行时间.

so it can be expected, that a.sum() is faster (despite the fact, that 2000 and not 1000 addition must be made - but as we have seen it is mostly about overhead and the actual work - the additions aren't responsible for the big share of the running time).

通过运行获取数据:

perf record python run.py
perf report

#run.py
import numpy as np
a=np.random.rand(1000,2)

for _ in range(10000):
  a.sum(axis=1)
  #a[:,0]+a[:,1]

这篇关于为什么numpy sum比+运算符慢10倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆