为什么numpy比循环慢 [英] Why is numpy slower than for loop
问题描述
我有一个函数使用一些循环,我想提高使用numpy的速度。但是,这看起来不是伎俩,因为颠簸版本似乎慢了2倍。这里是代码:
pre $ import numpy as np
import itertools
import timeit
def func():
sample = np.random.random_sample((100,2))
disc1 = 0
disc2 = 0
n_sample = len(样本)
dim = sample.shape [1]
在范围内(n_sample):
prod = 1
在范围内(dim) :
sub = np.abs(sample [i,k] - 0.5)
prod * = 1 + 0.5 * sub - 0.5 * sub ** 2
disc1 + = prod
为itertools.product(范围(n_sample),范围(n_sample))中的i,j:
prod = 1
对于范围(dim)中的k:
a = 0.5 * np.abs(sample [i,k] - 0.5)
b = 0.5 * np.abs(sample [j,k] - 0.5)
c = 0.5 * np.abs (i,k) - 样本[j,k])
prod * = 1 + a + b - c
disc2 + = prod
c2 =(13/12) ** dim-2 / n_sample * disc1 + 1 /(n_sample ** 2)* disc2
def func_numpy():
sample = np.random.random_sample((100,2))
disc1 = 0
disc2 = 0
n_sample = len (样本)
dim = sample.shape [1]
disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np。 abs(sample-0.5)** 2,axis = 1))
for itertools.product(range(n_sample),range(n_sample))中的i,j:
disc2 + = (样本[i] - 0.5)+ 0.5 * np.abs(样本[j] -0.5)-0.5 * np.abs(样本[i] - 样本[j]) )
c2 =(13/12)** dim-2 / n_sample * disc1 + 1 /(n_sample ** 2)* disc2
print( timeit.repeat('func()',number = 20,repeat = 5,setup =from __main__ import func))
print('numpy function time:',timeit'重复('func_numpy()',number = 20,repeat = 5,setup =from __main__ import func_numpy))
计时输出是:
正常功能时间:[2.831496894999873, 2.832342429959681,2.8009242500411347,2.8075121529982425,2.824807019031141]
numpy函数时间:[5.154757721000351,5,2011515340418555,5.148996959964279,5.095560318033677,5.125199959962629]
我在这里错过了什么?我知道瓶颈是itertools部分,因为我有一个100x100x2循环,而不是100x2循环。
您是否看到了另一种方式呢?
使用NumPy,我们必须寻找矢量化的东西,这里肯定是这样做的。仔细观察循环部分,我们沿着输入数据 samples
的第一个轴进行迭代循环启动:
pre $ for itertools.product(range(n_sample),range(n_sample))中的i,j:
我们可以将这些迭代转化为向量化的操作,一旦我们让 broadcast
处理这些。
现在,要有一个完全向量化的解决方案,我们需要更多的内存空间,特别是(N,N,M)
,其中(N,M)
是输入数据的形状。这里是在每次迭代中,我们没有做很多工作,因为我们在每一行执行操作,每行只包含给定样本的 2
元素。所以,出来的想法是我们可以沿着 M
运行一个循环,这样在每次迭代时,我们将计算 prod
并累积。因此,对于给定的示例,它只是两个循环迭代。
走出循环,我们会累积 prod
,只需要将 disc2
作为最终输出的总和。
下面是一个实现上面的实现上面提到的想法 -
prod_arr = 1
在范围内(sample.shape [1]):
si = sample [:,i]
prod_arr * = 1 + 0.5 * np.abs(si [:,None] - 0.5)+ 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si [:,None] - si)
disc2 = prod_arr.sum()
运行时测试
从原始方法和修改版本中删除了loopy部分如下: 定时和核实 -
pre $ def $ or $ b $ sample $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
for itertools.product中的i,j(范围(n_sample),范围(n_sample)):
dis (sample [j] - 0.5) - 0.5 * np.abs(sample [i] - 0.5)+ 0.5 * (i) - 样本[j]))
return disc2
def mod_app(样本):
prod_arr = 1
sample.shape [1]):
si = sample [:,i]
prod_arr * = 1 + 0.5 * np.abs(si [:,None] - 0.5)+ 0.5 * np.abs (si - 0.5) - \
0.5 * np.abs(si [:,None] - si)
disc2 = prod_arr.sum()
return disc2
在[11]中:org_app(sample)
Out [11]:11934.878683659041
In [12]:mod_app(sample)
Out [12]:11934.878683659068
In [14]:%timeit org_app(sample)
10个循环,最好3:每循环84.4毫秒
在[15]:%timeit mod_app(样本)
10000循环,最好是3:每循环94.6μs
关于 900x
加速!那么这应该是足够的激励,希望尽可能矢量化的东西。
I have a function using some for loops and I wanted to improve the speed using numpy. But this seems not to do the trick as the bumpy version appears to be 2 times slower. Here is the code:
import numpy as np
import itertools
import timeit
def func():
sample = np.random.random_sample((100, 2))
disc1 = 0
disc2 = 0
n_sample = len(sample)
dim = sample.shape[1]
for i in range(n_sample):
prod = 1
for k in range(dim):
sub = np.abs(sample[i, k] - 0.5)
prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
disc1 += prod
for i, j in itertools.product(range(n_sample), range(n_sample)):
prod = 1
for k in range(dim):
a = 0.5 * np.abs(sample[i, k] - 0.5)
b = 0.5 * np.abs(sample[j, k] - 0.5)
c = 0.5 * np.abs(sample[i, k] - sample[j, k])
prod *= 1 + a + b - c
disc2 += prod
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
def func_numpy():
sample = np.random.random_sample((100, 2))
disc1 = 0
disc2 = 0
n_sample = len(sample)
dim = sample.shape[1]
disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np.abs(sample - 0.5) ** 2, axis=1))
for i, j in itertools.product(range(n_sample), range(n_sample)):
disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
print('Normal function time: ' , timeit.repeat('func()', number=20, repeat=5, setup="from __main__ import func"))
print('numpy function time: ', timeit.repeat('func_numpy()', number=20, repeat=5, setup="from __main__ import func_numpy"))
The timing output is:
Normal function time: [2.831496894999873, 2.832342429959681, 2.8009242500411347, 2.8075121529982425, 2.824807019031141]
numpy function time: [5.154757721000351, 5.2011515340418555, 5.148996959964279, 5.095560318033677, 5.125199959962629]
What am I missing here? I know that the bottleneck is the itertools part because I have a 100x100x2 loop instead of a 100x2 loop before. Do you see another way to do that?
With NumPy, one must look to vectorize things and we could certainly do so here.
Taking a closer look at the loop portion, we are iterating along the first axis of the input data samples
twice with that loop startup :
for i, j in itertools.product(range(n_sample), range(n_sample)):
We could convert these iterations into vectorized operations, once we let broadcasting
handle those.
Now, to have a fully vectorized solution, we would need a lot more memory space, specifically (N,N,M)
, where (N,M)
is the shape of the input data.
Another noticeable aspect here is that at each iteration, we aren't doing a lot of work, as we are performing operation on each row and each row contains only 2
elements for the given sample. So, the idea that comes out is that we could run a loop along M
, such that at each iteration, we would compute the prod
and accumulate. Thus, for the given sample, its just two loop iterations.
Getting out of the loop, we would have the accumulated prod
, that just needs summation for disc2
as the final output.
Here's an implementation to fulfil the above mentioned ideas -
prod_arr = 1
for i in range(sample.shape[1]):
si = sample[:,i]
prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()
Runtime test
The stripped down version of the loopy portion from the original approach and the modified versions as approaches are listed below :
def org_app(sample):
disc2 = 0
n_sample = len(sample)
for i, j in itertools.product(range(n_sample), range(n_sample)):
disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * \
np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
return disc2
def mod_app(sample):
prod_arr = 1
for i in range(sample.shape[1]):
si = sample[:,i]
prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()
return disc2
Timings and verification -
In [10]: sample = np.random.random_sample((100, 2))
In [11]: org_app(sample)
Out[11]: 11934.878683659041
In [12]: mod_app(sample)
Out[12]: 11934.878683659068
In [14]: %timeit org_app(sample)
10 loops, best of 3: 84.4 ms per loop
In [15]: %timeit mod_app(sample)
10000 loops, best of 3: 94.6 µs per loop
About 900x
speedup! Well this should be motivating enough, hopefully to look to vectorize things whenever possible.
这篇关于为什么numpy比循环慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!