numpy中不同向量化方法的性能 [英] Performance in different vectorization method in numpy

查看:56
本文介绍了numpy中不同向量化方法的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想测试python中矢量化代码的性能:

I wanted to test the performance of vectorizing code in python:

import timeit
import numpy as np

def func1():
  x = np.arange(1000)
  sum = np.sum(x*2)
  return sum

def func2():
  sum = 0
  for i in xrange(1000):
    sum += i*2
  return sum

def func3():
  sum = 0
  for i in xrange(0,1000,4):
    x = np.arange(i,i+4,1)
    sum += np.sum(x*2)
  return sum

print timeit.timeit(func1, number = 1000)
print timeit.timeit(func2, number = 1000)
print timeit.timeit(func3, number = 1000)

代码提供以下输出:

0.0105729103088
0.069864988327
0.983253955841

第一和第二个功能的性能差异不足为奇.但是令我惊讶的是,第三个功能比其他功能要慢得多.

The performance difference in the first and second functions are not surprising. But I was surprised that the 3rd function is significantly slower than the other functions.

我对C语言中的向量化代码比对Python更为熟悉,第3个函数更像C语言-运行for循环并在每个循环中的一条指令中处理4个数字.据我了解,numpy调用C函数,然后对C中的代码进行矢量化处理.因此,在这种情况下,我的代码也一次将4个数字分别传递给numpy.当我一次传递更多数字时,代码的性能应该不会更好.那么为什么它要慢得多呢?是因为调用numpy函数的开销吗?

I am much more familiar in vectorising code in C than in Python and the 3rd function is more C-like - running a for loop and processing 4 numbers in one instruction in each loop. To my understanding numpy calls a C function and then vectorize the code in C. So if this is the case my code is also passing 4 numbers to numpy each at a time. The code shouldn't perform better when I pass more numbers at once. So why is it much more slower? Is it because of the overhead in calling a numpy function?

此外,我之初甚至想出了第3个函数的原因是因为我担心func1中为x分配大量内存的性能.

Besides, the reason that I even came up with the 3rd function in the first place is because I'm worried about the performance of the large amount of memory allocation to x in func1.

我的担心有效吗?为什么以及如何改进它?为什么不呢?

Is my worry valid? Why and how can I improve it or why not?

先谢谢了.

出于好奇,尽管它违背了我创建第三个版本的初衷,但我研究了roganjosh的建议并尝试了以下编辑.

For curiosity sake, although it defeats my original purpose for creating the 3rd version, I have looked into roganjosh's suggestion and tried the following edit.

def func3():
  sum = 0
  x = np.arange(0,1000)
  for i in xrange(0,1000,4):
    sum += np.sum(x[i:i+4]*2)
  return sum

输出:

0.0104308128357
0.0630609989166
0.748773813248

虽然有改进,但与其他功能相比仍有很大差距.

There is an improvement, but still a large gap compared with the other functions.

是因为x[i:i+4]仍会创建一个新数组吗?

Is it because x[i:i+4] still creates a new array?

根据丹尼尔(Daniel)的建议,我再次修改了代码.

I've modified the code again according to Daniel's suggestion.

def func1():
  x = np.arange(1000)
  x *= 2
  return x.sum()

def func3():
  sum = 0
  x = np.arange(0,1000)
  for i in xrange(0,1000,4):
    x[i:i+4] *= 2
    sum += x[i:i+4].sum()
  return sum

输出:

0.00824999809265
0.0660569667816
0.598328828812

还有另一种提速.因此,声明numpy数组绝对是一个问题.现在在func3中应该只有一个数组声明,但是时间仍然很慢.是因为调用numpy数组的开销吗?

There is another speedup. So the declaration of numpy arrays are definitely a problem.Now in func3 there should be one array declaration only, but yet the time is still way slower. Is it because of the overhead of calling numpy arrays?

推荐答案

似乎您最感兴趣的是功能3与 pure NumPy(功能1)和Python(功能2)方法.答案很简单(特别是如果您看功能4的话):

It seems you're mostly interested in the difference between your function 3 compared to the pure NumPy (function 1) and Python (function 2) approaches. The answer is quite simple (especially if you look at function 4):

  • NumPy函数具有一个巨大"的常数因子.

您通常需要数千个元素才能进入np.sum运行时实际上取决于数组中元素数量的状态.使用IPython和matplotlib(图解位于答案的结尾),您可以轻松地检查运行时依赖项:

You typically need several thousand elements to get in the regime where the runtime of np.sum actually depends on the number of elements in the array. Using IPython and matplotlib (the plot is at the end of the answer) you can easily check the runtime dependency:

import numpy as np

n = []
timing_sum1 = []
timing_sum2 = []
for i in range(1, 25):
    num = 2**i
    arr = np.arange(num)
    print(num)
    time1 = %timeit -o arr.sum()    # calling the method
    time2 = %timeit -o np.sum(arr)  # calling the function
    n.append(num)
    timing_sum1.append(time1)
    timing_sum2.append(time2)

np.sum(缩短)的结果非常有趣:

The results for np.sum (shortened) are quite interesting:

4
22.6 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16
25.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
64
25.3 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256
24.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1024
24.6 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4096
27.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16384
40.6 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65536
91.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
262144
394 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1048576
1.24 ms ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4194304
4.71 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16777216
18.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

似乎常数在我的计算机上大约为20µs,它需要一个包含16384 000个元素的数组来使时间加倍.因此,函数3和4的时序大多是常数因子的时序倍数.

It seems the constant factor is roughly 20µs on my computer) and it takes an array with 16384 thousand elements to double that time. So the timing for function 3 and 4 are mostly timing multiplicatives of the constant factor.

在功能3中,您两次包含常数因子,一次使用np.sum,一次使用np.arange.在这种情况下,arange相当便宜,因为每个数组的大小都相同,因此NumPy& python&您的操作系统可能会重用上一次迭代的数组的内存.但是,即使那样也要花费时间(对于我的计算机上的非常小的阵列,大约为2µs).

In function 3 you include the constant factor 2 times, once with np.sum and once with np.arange. In this case arange is quite cheap because each array is the same size, so NumPy & Python & your OS probably reuse the memory of the array of the last iteration. However even that takes time (roughly 2µs for very small arrays on my computer).

更一般而言:要确定瓶颈,您应该始终对功能进行概要分析!

More generally: To identify bottlenecks you should always profile the functions!

我使用 line-profiler 显示函数的结果.因此,我对功能进行了一些更改,以便它们每行仅执行一次操作:

I show the results for the functions with line-profiler. Therefore I altered the functions a bit so they only do one operation per line:

import numpy as np

def func1():
    x = np.arange(1000)
    x = x*2
    return np.sum(x)

def func2():
    sum_ = 0
    for i in range(1000):
        tmp = i*2
        sum_ += tmp
    return sum_

def func3():
    sum_ = 0
    for i in range(0, 1000, 4):  # I'm using python3, so "range" is like "xrange"!
        x = np.arange(i, i + 4, 1)
        x = x * 2
        tmp = np.sum(x)
        sum_ += tmp
    return sum_

def func4():
    sum_ = 0
    x = np.arange(1000)
    for i in range(0, 1000, 4):
        y = x[i:i + 4]
        y = y * 2
        tmp = np.sum(y)
        sum_ += tmp
    return sum_

结果:

%load_ext line_profiler

%lprun -f func1 func1()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           def func1():
     5         1           62     62.0     23.8      x = np.arange(1000)
     6         1           65     65.0     24.9      x = x*2
     7         1          134    134.0     51.3      return np.sum(x)

%lprun -f func2 func2()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           def func2():
    10         1            7      7.0      0.1      sum_ = 0
    11      1001         2523      2.5     30.9      for i in range(1000):
    12      1000         2819      2.8     34.5          tmp = i*2
    13      1000         2819      2.8     34.5          sum_ += tmp
    14         1            3      3.0      0.0      return sum_

%lprun -f func3 func3()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                           def func3():
    17         1            7      7.0      0.0      sum_ = 0
    18       251          909      3.6      2.9      for i in range(0, 1000, 4):
    19       250         6527     26.1     21.2          x = np.arange(i, i + 4, 1)
    20       250         5615     22.5     18.2          x = x * 2
    21       250        16053     64.2     52.1          tmp = np.sum(x)
    22       250         1720      6.9      5.6          sum_ += tmp
    23         1            3      3.0      0.0      return sum_

%lprun -f func4 func4()
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                           def func4():
    26         1            7      7.0      0.0      sum_ = 0
    27         1           49     49.0      0.2      x = np.arange(1000)
    28       251          892      3.6      3.4      for i in range(0, 1000, 4):
    29       250         2177      8.7      8.3          y = x[i:i + 4]
    30       250         5431     21.7     20.7          y = y * 2
    31       250        15990     64.0     60.9          tmp = np.sum(y)
    32       250         1686      6.7      6.4          sum_ += tmp
    33         1            3      3.0      0.0      return sum_

我不会详细介绍结果,但是如您所见,np.sum无疑是func3func4的瓶颈.在写答案之前,我已经猜到np.sum是瓶颈,但是这些行配置文件实际上证实 瓶颈.

I won't go into the details of the results, but as you can see np.sum is definetly the bottleneck in func3 and func4. I already guessed that np.sum is the bottleneck before I wrote the answer but these line-profilings actually verify that it is the bottleneck.

使用NumPy会导致一个非常重要的事实:

Which leads to a very important fact when using NumPy:

  • 知道何时使用它!小数组不值得(主要是).
  • 了解NumPy函数并使用它们.他们已经使用(如果可用)编译器优化标志来展开循环.

如果您真的认为某些部分太慢,则可以使用:

If you really believe some part is too slow then you can use:

  • NumPy的C API并使用C处理数组(使用Cython确实很容易,但您也可以手动进行操作)
  • Numba(基于LLVM).

但是通常来说,对于中等大小的数组(成千上万个条目以及更多),您可能无法击败NumPy.

But generally you probably can't beat NumPy for moderatly sized (several thousand entries and more) arrays.

%matplotlib notebook

import matplotlib.pyplot as plt

# Average time per sum-call
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average for time in timing_sum1], label='arr.sum()', c='red')
ax.plot(n, [time.average for time in timing_sum2], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time it takes to sum them [seconds]')
ax.grid(which='both')
ax.legend()

# Average time per element
fig = plt.figure(1)
ax = plt.subplot(111)
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum1)], label='arr.sum()', c='red')
ax.plot(n, [time.average / num for num, time in zip(n, timing_sum2)], label='np.sum(arr)', c='blue')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time per element [seconds / element]')
ax.grid(which='both')
ax.legend()

这些图是对数对数的,我认为这是可视化数据的最佳方法,因为它可以扩展几个数量级(我只是希望它仍然可以理解).

The plots are log-log, I think it was the best way to visualize the data given that it extends several orders of magnitude (I just hope it's still understandable).

第一张图显示了执行sum所需的时间:

The first plot shows how much time it takes to do the sum:

第二个图显示了执行sum所需的平均时间除以数组中元素的数量.这只是解释数据的另一种方式:

The second plot shows the average time it takes to do the sum divided by the number of elements in the array. This is just another way to interpret the data:

这篇关于numpy中不同向量化方法的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆