numpy ufuncs速度与循环速度 [英] numpy ufuncs speed vs for loop speed
问题描述
我读了很多避免使用numpy进行循环".所以,我尝试了.我正在使用此代码(简化版).一些辅助数据:
I've read a lot "avoid for loops with numpy". So, I tried. I was using this code (simplified version). Some auxiliary data:
In[1]: import numpy as np
resolution = 1000 # this parameter varies
tim = np.linspace(-np.pi, np.pi, resolution)
prec = np.arange(1, resolution + 1)
prec = 2 * prec - 1
values = np.zeros_like(tim)
我的第一个实现是使用for
循环:
My first implementation was with for
loop:
In[2]: for i, ti in enumerate(tim):
values[i] = np.sum(np.sin(prec * ti))
然后,我摆脱了明确的for
循环,并实现了这一目标:
Then, I got rid of the explicit for
cycle, and achieved this:
In[3]: values = np.sum(np.sin(tim[:, np.newaxis] * prec), axis=1)
这个解决方案对于小型阵列更快,但是当我扩大规模时,我得到了这样的时间依赖性:
And this solution was faster for small arrays, but when I scaled up I got such time dependence:
我所缺少的还是正常行为?如果不是,那么在哪里挖掘?
What I'm missing or is it normal behavior? And if it is not, where to dig?
编辑:根据评论,这是一些其他信息.时间是使用IPython的%timeit
和%%timeit
测量的,每次运行都是在新内核上执行的.我的笔记本电脑是 acer aspire v7-482pg(i7,8GB).我正在使用:
EDIT: According to the comments, here is some additional information. The times were measured with IPython's %timeit
and %%timeit
, every run was performed on the fresh kernel. My laptop is acer aspire v7-482pg (i7, 8GB). I'm using:
- python 3.5.2
- numpy 1.11.2 + mkl
- Windows 10
推荐答案
这是正常且预期的行为.太简单了,以至于无法应用避免使用numpy进行循环" 语句.如果您要处理内部循环,那么它(几乎)总是正确的.但是在外循环的情况下(如您的情况),会有更多例外.特别是如果替代方法是使用广播,因为这会通过使用更多内存来加快您的操作.
This is normal and expected behaviour. It's just too simplified to apply the "avoid for loops with numpy" statement everywere. If you're dealing with inner loops it's (almost) always true. But in case of outer loops (like in your case) there are far more exceptions. Especially if the alternative is to use broadcasting because this speeds up your operation by using a lot more memory.
只需为该避免使用numpy进行循环" 语句添加背景:
Just to add a bit of background to that "avoid for loops with numpy" statement:
NumPy数组存储为具有 c 类型. Python int
与C int
不同!因此,每当遍历数组中的每个项目时,都需要从数组中插入该项目,将其转换为Python int
,然后对其进行任何处理,最后可能需要再次将其转换为ac整数. (称为装箱和拆箱值).例如,您要使用Python sum
数组中的项目:
NumPy arrays are stored as contiguous arrays with c types. The Python int
is not the same as a C int
! So whenever you iterate over each item in an array you need to plug the item from the array, convert it to a Python int
and then do whatever you want to do with it and finally you may need to convert it to a c integer again (called boxing and unboxing the value). For example you want to sum
the items in an array using Python:
import numpy as np
arr = np.arange(1000)
%%timeit
acc = 0
for item in arr:
acc += item
# 1000 loops, best of 3: 478 µs per loop
您最好使用numpy:
You better use numpy:
%timeit np.sum(arr)
# 10000 loops, best of 3: 24.2 µs per loop
即使将循环推入Python C代码中,您也无法达到numpy的性能:
Even if you push the loop into Python C code you're far away from the numpy performance:
%timeit sum(arr)
# 1000 loops, best of 3: 387 µs per loop
此规则可能会有例外,但这些例外确实很少.至少只要有一些等效的numpy功能.因此,如果要遍历单个元素,则应使用numpy.
There might be exceptions from this rule but these will be really sparse. At least as long as there is some equivalent numpy functionality. So if you would iterate over single elements then you should use numpy.
有时候,简单的python循环就足够了.它并不广为人知,但与Python函数相比,numpy函数具有巨大的开销.例如,考虑一个3元素数组:
Sometimes a plain python loop is enough. It's not widly advertised but numpy functions have a huge overhead compared to Python functions. For example consider a 3-element array:
arr = np.arange(3)
%timeit np.sum(arr)
%timeit sum(arr)
哪个会更快?
解决方案:Python函数的性能优于numpy解决方案:
Solution: The Python function performs better than the numpy solution:
# 10000 loops, best of 3: 21.9 µs per loop <- numpy
# 100000 loops, best of 3: 6.27 µs per loop <- python
但这与您的示例有什么关系?实际上并没有那么多,因为您总是在数组上使用numpy函数(不是单个元素,甚至不是几个元素),因此您的内部循环已经使用了优化函数.这就是为什么两者的性能大致相同的原因(只有很少的元素的+/- 10到大约500个元素的2).但这并不是真正的循环开销,而是函数调用开销!
But what does this have to do with your example? Not all that much in fact because you always use numpy-functions on arrays (not single elements and not even few elements) so your inner loop already uses the optimized functions. That's why both perform roughly the same (+/- a factor 10 with very few elements to a factor 2 at roughly 500 elements). But that's not really the loop overhead, it's the function call overhead!
使用 line-profiler 和resolution = 100
:
def fun_func(tim, prec, values):
for i, ti in enumerate(tim):
values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 101 752 7.4 5.7 for i, ti in enumerate(tim):
3 100 12449 124.5 94.3 values[i] = np.sum(np.sin(prec * ti))
95%的钱都花在了循环中,我什至将循环主体分为几部分来验证:
95% is spent inside the loop, I've even split the loop body into several parts to verify this:
def fun_func(tim, prec, values):
for i, ti in enumerate(tim):
x = prec * ti
x = np.sin(x)
x = np.sum(x)
values[i] = x
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 101 609 6.0 3.5 for i, ti in enumerate(tim):
3 100 4521 45.2 26.3 x = prec * ti
4 100 4646 46.5 27.0 x = np.sin(x)
5 100 6731 67.3 39.1 x = np.sum(x)
6 100 714 7.1 4.1 values[i] = x
这里的时间使用者为np.multiply
,np.sin
,np.sum
,因为您可以通过比较每次通话的时间和开销来轻松查看:
The time consumers are np.multiply
, np.sin
, np.sum
here, as you can easily check by comparing their time per call with their overhead:
arr = np.ones(1, float)
%timeit np.sum(arr)
# 10000 loops, best of 3: 22.6 µs per loop
因此,与计算运行时相比,通信功能调用的开销较小时,您将拥有类似的运行时.即使有100个项目,您也非常接近间接费用时间.诀窍是要知道他们在什么时候收支平衡.对于1000个项目,通话费用仍然很可观:
So as soon as the comulative function call overhead is small compared to the calculation runtime you'll have similar runtimes. Even with 100 items you're quite close to the overhead time. The trick is to know at which point they break-even. With 1000 items the call-overhead is still significant:
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1001 5864 5.9 2.4 for i, ti in enumerate(tim):
3 1000 42817 42.8 17.2 x = prec * ti
4 1000 119327 119.3 48.0 x = np.sin(x)
5 1000 73313 73.3 29.5 x = np.sum(x)
6 1000 7287 7.3 2.9 values[i] = x
但是使用resolution = 5000
时,与运行时相比,开销非常低:
But with resolution = 5000
the overhead is quite low compared to the runtime:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 5001 29412 5.9 0.9 for i, ti in enumerate(tim):
3 5000 388827 77.8 11.6 x = prec * ti
4 5000 2442460 488.5 73.2 x = np.sin(x)
5 5000 441337 88.3 13.2 x = np.sum(x)
6 5000 36187 7.2 1.1 values[i] = x
当您在每个np.sin
呼叫中花费500us时,您不再关心20us的开销了.
When you spent 500us in each np.sin
call you don't care about the 20us overhead anymore.
可能需要注意一点:line_profiler
每行可能包括一些额外的开销,也许每个函数调用也要包括一些额外的开销,因此可以忽略的函数调用开销可能会更低!
A word of caution is probably in order: line_profiler
probably includes some additional overhead per line, maybe also per function call, so the point at which the function call overhead becomes negligable might be lower!!!
我首先介绍了第一个解决方案,让我们对第二个解决方案做同样的事情:
I started by profiling the first solution, let's do the same with the second solution:
def fun_func(tim, prec, values):
x = tim[:, np.newaxis]
x = x * prec
x = np.sin(x)
x = np.sum(x, axis=1)
return x
再次将line_profiler与resolution=100
一起使用:
Again using the line_profiler with resolution=100
:
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 27 27.0 0.5 x = tim[:, np.newaxis]
3 1 638 638.0 12.9 x = x * prec
4 1 3963 3963.0 79.9 x = np.sin(x)
5 1 326 326.0 6.6 x = np.sum(x, axis=1)
6 1 4 4.0 0.1 return x
这已经大大超过了开销时间,因此与循环相比,我们的速度快了10倍.
This already exceeds the overhead time significantly and thus we end up a factor 10 faster compared to the loop.
我也对resolution=1000
做了分析:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 28 28.0 0.0 x = tim[:, np.newaxis]
3 1 17716 17716.0 14.6 x = x * prec
4 1 91174 91174.0 75.3 x = np.sin(x)
5 1 12140 12140.0 10.0 x = np.sum(x, axis=1)
6 1 10 10.0 0.0 return x
以及precision=5000
:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 34 34.0 0.0 x = tim[:, np.newaxis]
3 1 333685 333685.0 11.1 x = x * prec
4 1 2391812 2391812.0 79.6 x = np.sin(x)
5 1 280832 280832.0 9.3 x = np.sum(x, axis=1)
6 1 14 14.0 0.0 return x
1000大小仍然更快,但是正如我们已经看到的那样,循环解决方案中的呼叫开销仍然不可忽略.但是对于resolution = 5000
,每个步骤花费的时间几乎相同(有些慢一些,有些快一些,但总体上很相似)
The 1000 size is still faster, but as we've seen there the call overhead was still not-negligable in the loop-solution. But for resolution = 5000
the time spent in each step is almost identical (some are a bit slower, others faster but overall quite similar)
另一个效果是,乘法时实际的广播变得很重要.即使使用非常聪明的Numpy解决方案,它仍然包括一些其他计算.对于resolution=10000
,您看到广播乘法相对于循环解决方案开始占用更多的%time":
Another effect is that the actual broadcasting when you do the multiplication becomes significant. Even with the very smart numpy solutions this still includes some additional calculations. For resolution=10000
you see that the broadcasting multiplication starts to take up more "% time" in relation to the loop solution:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def broadcast_solution(tim, prec, values):
2 1 37 37.0 0.0 x = tim[:, np.newaxis]
3 1 1783345 1783345.0 13.9 x = x * prec
4 1 9879333 9879333.0 77.1 x = np.sin(x)
5 1 1153789 1153789.0 9.0 x = np.sum(x, axis=1)
6 1 11 11.0 0.0 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 def loop_solution(tim, prec, values):
9 10001 62502 6.2 0.5 for i, ti in enumerate(tim):
10 10000 1287698 128.8 10.5 x = prec * ti
11 10000 9758633 975.9 79.7 x = np.sin(x)
12 10000 1058995 105.9 8.6 x = np.sum(x)
13 10000 75760 7.6 0.6 values[i] = x
但是除了实际花费的时间之外,还有另一件事:内存消耗.您的循环解决方案需要O(n)
内存,因为您始终要处理n
元素.但是,广播解决方案需要O(n*n)
内存.如果将resolution=20000
与循环一起使用,则可能需要等待一段时间,但仍然只需要8bytes/element * 20000 element ~= 160kB
,而对于广播,则需要~3GB
.而且这忽略了常数因子(例如临时数组又称为中间数组)!假设您走得更远,您将很快耗尽内存!
But there is another thing besides actual time spent: the memory consumption. Your loop solution requires O(n)
memory because you always process n
elements. The broadcasting solution however requires O(n*n)
memory. You probably have to wait some time if you use resolution=20000
with your loop but it would still only require 8bytes/element * 20000 element ~= 160kB
but with the broadcasting you'll need ~3GB
. And this neglects the constant factor (like temporary arrays aka intermediate arrays)! Suppose you go even further you'll run out of memory very fast!
是时候再次总结要点了:
Time to summarize the points again:
- 如果您对numpy数组中的单个项目执行python循环操作,那是错误的.
- 如果您遍历numpy数组的子数组,请确保与使用该函数所花费的时间相比,每个循环中的函数调用开销都可以忽略不计.
- 如果广播numpy数组,请确保不会耗尽内存.
但是关于优化的最重要的一点仍然是:
But the most important point about optimization is still:
-
仅在速度太慢时优化代码!如果速度太慢,则仅在分析代码后进行优化.
Only optimize the code if it's too slow! If it's too slow then optimize only after profiling your code.
不要盲目地相信简化的语句,并且再次永远不会在没有概要分析的情况下进行优化.
Don't blindly trust simplified statements and again never optimize without profiling.
最后一个想法:
可以使用 cython ,numba 或 numexpr 的问题,如果 numpy 或 scipy .
Such functions that either require a loop or broadcasting can be easily implemented using cython, numba or numexpr if there is no already existing solution in numpy or scipy.
例如,将循环解决方案的内存效率与低resolutions
的广播解决方案的速度相结合的numba函数看起来像这样:
For example a numba function that combines the memory efficiency from the loop solution with the speed of the broadcast solution at low resolutions
would look like this:
from numba import njit
import math
@njit
def numba_solution(tim, prec, values):
size = tim.size
for i in range(size):
ti = tim[i]
x = 0
for j in range(size):
x += math.sin(prec[j] * ti)
values[i] = x
如评论中所指出的,numexpr
还可以非常快速且无需评估O(n*n)
内存来评估广播的计算:
As pointed out in the comments numexpr
can also evaluate the broadcasted calculation very fast and without requiring O(n*n)
memory:
>>> import numexpr
>>> tim_2d = tim[:, np.newaxis]
>>> numexpr.evaluate('sum(sin(tim_2d * prec), axis=1)')
这篇关于numpy ufuncs速度与循环速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!