Numpy 就地操作性能 [英] Numpy in-place operation performance

查看:31
本文介绍了Numpy 就地操作性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将 numpy 数组就地操作与常规操作进行比较.这是我所做的(Python 3.7.3 版):

I was comparing numpy array in-place operation with regular operation. And here is what I did (Python version 3.7.3):

    a1, a2 = np.random.random((10,10)), np.random.random((10,10))

为了进行比较:

    def func1(a1, a2):
        a1 = a1 + a2

    def func2(a1, a2):
        a1 += a2

%timeit func1(a1, a2)
%timeit func2(a1, a2)

因为就地操作避免了为每个循环分配内存.我期望 func1func2 慢.

Because in-place operation avoid the allocation of memory for each loop. I was expecting func1 to be slower than func2.

但是我得到了这个:

In [10]: %timeit func1(a1, a2)
595 ns ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [11]: %timeit func2(a1, a2)
1.38 µs ± 7.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [12]: np.__version__
Out[12]: '1.16.2'

这表明 func1 的时间仅为 func2 的 1/2.谁能帮忙解释一下为什么会这样?

Which suggests func1 is only 1/2 of the time took by func2. Can anyone help to explain why this is the case?

推荐答案

我觉得这很有趣,并决定自己计时.但我并没有只检查 10x10 数组,而是使用 NumPy 1.16.2 测试了许多不同的数组大小:

I found this very intriguing and decided to time this myself. But instead of just checking for 10x10 arrays I tested a lot of different array sizes with NumPy 1.16.2:

这清楚地表明,对于小数组大小,正常添加速度更快,而仅对于中等大小的数组大小,就地操作更快.还有一个我无法解释的大约 100000 个元素的奇怪凸起(它接近我计算机上的页面大小,可能使用了不同的分配方案).

This clearly shows that for small array sizes the normal addition is faster and only for moderately large array sizes the in-place operation is faster. There is also a weird bump around 100000 elements that I cannot explain (it's close to the page size on my computer, maybe there a different allocation scheme is used).

分配一个临时数组预计会比较慢,因为:

Allocating a temporary array is expected to be slower because:

  • 必须分配内存
  • 必须迭代 3 个数组才能执行操作,而不是 2 个.

特别是第一点(分配内存)可能没有在基准测试中考虑(不是 %timeit 不是 simple_benchmark.run).那是因为一遍又一遍地请求相同的内存大小可能会非常优化.这会使额外数组的添加看起来比实际快一些.

Especially the first point (allocating the memory) is probably not accounted for in the benchmark (not with %timeit not with the simple_benchmark.run). That's because requesting the same memory-size over and over again will be something that is probably very optimized. Which would make the addition with an extra array seem a bit faster than it actually is.

这里要提到的另一点是就地添加可能具有更高的常数因子.如果您正在执行就地添加,则必须先进行更多代码检查,然后才能执行操作,例如重叠输入.这可以为就地添加提供更高的常数因子.

Another point to mention here is that in-place addition probably has a higher constant factor. If you're doing an in-place addition you have do to more code-checks before you can perform the operation, for example for overlapping inputs. That could give in-place addition a higher constant factor.

作为更一般的建议:微基准测试可能会有所帮助,但它们并不总是非常准确.您还应该对调用它的代码进行基准测试,以对代码的实际性能做出更有根据的陈述.通常,此类微基准测试会遇到一些高度优化的情况(例如,重复分配相同数量的内存并再次释放),而在实际使用代码时不会(如此频繁)发生这种情况.

As a more general advise: Micro-benchmarks can be helpful but they are not always really accurate. You should also benchmark the code that calls it to make more educated statements about the actual performance of your code. Often such micro-benchmarks hit some highly optimized cases (for example repeatedly allocating the same amount of memory and releasing it again), that wouldn't happen (so often) when the code is actually used.

这里也是我用于图表的代码,使用我的库simple_benchmark:

Here is also the code I used for the graph, using my library simple_benchmark:

from simple_benchmark import BenchmarkBuilder, MultiArgument
import numpy as np

b = BenchmarkBuilder()

@b.add_function()
def func1(a1, a2):
    a1 = a1 + a2

@b.add_function()
def func2(a1, a2):
    a1 += a2

@b.add_arguments('array size')
def argument_provider():
    for exp in range(3, 28):
        dim_size = int(1.4**exp)
        a1 = np.random.random([dim_size, dim_size])
        a2 = np.random.random([dim_size, dim_size])
        yield dim_size ** 2, MultiArgument([a1, a2])

r = b.run()
r.plot()

这篇关于Numpy 就地操作性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆