为什么numpy比循环慢 [英] Why is numpy slower than for loop

查看:230
本文介绍了为什么numpy比循环慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个函数使用一些循环,我想提高使用numpy的速度。但是,这看起来不是伎俩,因为颠簸版本似乎慢了2倍。这里是代码:

pre $ import numpy as np
import itertools
import timeit

def func():
sample = np.random.random_sample((100,2))

disc1 = 0
disc2 = 0
n_sample = len(样本)
dim = sample.shape [1]

在范围内(n_sample):
prod = 1
在范围内(dim) :
sub = np.abs(sample [i,k] - 0.5)
prod * = 1 + 0.5 * sub - 0.5 * sub ** 2

disc1 + = prod

为itertools.product(范围(n_sample),范围(n_sample))中的i,j:
prod = 1
对于范围(dim)中的k:
a = 0.5 * np.abs(sample [i,k] - 0.5)
b = 0.5 * np.abs(sample [j,k] - 0.5)
c = 0.5 * np.abs (i,k) - 样本[j,k])
prod * = 1 + a + b - c
disc2 + = prod

c2 =(13/12) ** dim-2 / n_sample * disc1 + 1 /(n_sample ** 2)* disc2


def func_numpy():
sample = np.random.random_sample((100,2))

disc1 = 0
disc2 = 0
n_sample = len (样本)
dim = sample.shape [1]

disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np。 abs(sample-0.5)** 2,axis = 1))

for itertools.product(range(n_sample),range(n_sample))中的i,j:
disc2 + = (样本[i] - 0.5)+ 0.5 * np.abs(样本[j] -0.5)-0.5 * np.abs(样本[i] - 样本[j]) )

c2 =(13/12)** dim-2 / n_sample * disc1 + 1 /(n_sample ** 2)* disc2


print( timeit.repeat('func()',number = 20,repeat = 5,setup =from __main__ import func))
print('numpy function time:',timeit'重复('func_numpy()',number = 20,repeat = 5,setup =from __main__ import func_numpy))

计时输出是:

 正常功能时间:[2.831496894999873, 2.832342429959681,2.8009242500411347,2.8075121529982425,2.824807019031141] 
numpy函数时间:[5.154757721000351,5,2011515340418555,5.148996959964279,5.095560318033677,5.125199959962629]

我在这里错过了什么?我知道瓶颈是itertools部分,因为我有一个100x100x2循环,而不是100x2循环。
您是否看到了另一种方式呢?

解决方案

使用NumPy,我们必须寻找矢量化的东西,这里肯定是这样做的。仔细观察循环部分,我们沿着输入数据 samples 的第一个轴进行迭代循环启动:

pre $ for itertools.product(range(n_sample),range(n_sample))中的i,j:

我们可以将这些迭代转化为向量化的操作,一旦我们让 broadcast 处理这些。

现在,要有一个完全向量化的解决方案,我们需要更多的内存空间,特别是(N,N,M),其中(N,M)是输入数据的形状。这里是在每次迭代中,我们没有做很多工作,因为我们在每一行执行操作,每行只包含给定样本的 2 元素。所以,出来的想法是我们可以沿着 M 运行一个循环,这样在每次迭代时,我们将计算 prod 并累积。因此,对于给定的示例,它只是两个循环迭代。



走出循环,我们会累积 prod ,只需要将 disc2 作为最终输出的总和。



下面是一个实现上面的实现上面提到的想法 -

  prod_arr = 1 
在范围内(sample.shape [1]):
si = sample [:,i]
prod_arr * = 1 + 0.5 * np.abs(si [:,None] - 0.5)+ 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si [:,None] - si)
disc2 = prod_arr.sum()

运行时测试

从原始方法和修改版本中删除了loopy部分如下:

pre $ def $ or $ b $ sample $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
for itertools.product中的i,j(范围(n_sample),范围(n_sample)):
dis (sample [j] - 0.5) - 0.5 * np.abs(sample [i] - 0.5)+ 0.5 * (i) - 样本[j]))
return disc2


def mod_app(样本):
prod_arr = 1
sample.shape [1]):
si = sample [:,i]
prod_arr * = 1 + 0.5 * np.abs(si [:,None] - 0.5)+ 0.5 * np.abs (si - 0.5) - \
0.5 * np.abs(si [:,None] - si)
disc2 = prod_arr.sum()
return disc2

定时和核实 -

  
在[11]中:org_app(sample)
Out [11]:11934.878683659041

In [12]:mod_app(sample)
Out [12]:11934.878683659068

In [14]:%timeit org_app(sample)
10个循环,最好3:每循环84.4毫秒

在[15]:%timeit mod_app(样本)
10000循环,最好是3:每循环94.6μs

关于 900x 加速!那么这应该是足够的激励,希望尽可能矢量化的东西。


I have a function using some for loops and I wanted to improve the speed using numpy. But this seems not to do the trick as the bumpy version appears to be 2 times slower. Here is the code:

import numpy as np
import itertools
import timeit

def func():
    sample = np.random.random_sample((100, 2))

    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    for i in range(n_sample):
        prod = 1
        for k in range(dim):
            sub = np.abs(sample[i, k] - 0.5)
            prod *= 1 + 0.5 * sub - 0.5 * sub ** 2

        disc1 += prod

    for i, j in itertools.product(range(n_sample), range(n_sample)):
        prod = 1
        for k in range(dim):
            a = 0.5 * np.abs(sample[i, k] - 0.5)
            b = 0.5 * np.abs(sample[j, k] - 0.5)
            c = 0.5 * np.abs(sample[i, k] - sample[j, k])
            prod *= 1 + a + b - c
        disc2 += prod

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


def func_numpy():
    sample = np.random.random_sample((100, 2))

    disc1 = 0
    disc2 = 0
    n_sample = len(sample)
    dim = sample.shape[1]

    disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np.abs(sample - 0.5) ** 2, axis=1))

    for i, j in itertools.product(range(n_sample), range(n_sample)):
        disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))

    c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2


print('Normal function time: ' , timeit.repeat('func()', number=20, repeat=5, setup="from __main__ import func"))
print('numpy function time: ', timeit.repeat('func_numpy()', number=20, repeat=5, setup="from __main__ import func_numpy"))

The timing output is:

Normal function time:  [2.831496894999873, 2.832342429959681, 2.8009242500411347, 2.8075121529982425, 2.824807019031141]
numpy function time:  [5.154757721000351, 5.2011515340418555, 5.148996959964279, 5.095560318033677, 5.125199959962629]

What am I missing here? I know that the bottleneck is the itertools part because I have a 100x100x2 loop instead of a 100x2 loop before. Do you see another way to do that?

解决方案

With NumPy, one must look to vectorize things and we could certainly do so here.

Taking a closer look at the loop portion, we are iterating along the first axis of the input data samples twice with that loop startup :

for i, j in itertools.product(range(n_sample), range(n_sample)):

We could convert these iterations into vectorized operations, once we let broadcasting handle those.

Now, to have a fully vectorized solution, we would need a lot more memory space, specifically (N,N,M), where (N,M) is the shape of the input data.

Another noticeable aspect here is that at each iteration, we aren't doing a lot of work, as we are performing operation on each row and each row contains only 2 elements for the given sample. So, the idea that comes out is that we could run a loop along M, such that at each iteration, we would compute the prod and accumulate. Thus, for the given sample, its just two loop iterations.

Getting out of the loop, we would have the accumulated prod, that just needs summation for disc2 as the final output.

Here's an implementation to fulfil the above mentioned ideas -

prod_arr = 1
for i in range(sample.shape[1]):
    si = sample[:,i]
    prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
                                    0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()

Runtime test

The stripped down version of the loopy portion from the original approach and the modified versions as approaches are listed below :

def org_app(sample):
    disc2 = 0
    n_sample = len(sample)
    for i, j in itertools.product(range(n_sample), range(n_sample)):
        disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * \
            np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
    return disc2


def mod_app(sample):
    prod_arr = 1
    for i in range(sample.shape[1]):
        si = sample[:,i]
        prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
                                        0.5 * np.abs(si[:,None] - si)
    disc2 = prod_arr.sum()
    return disc2

Timings and verification -

In [10]: sample = np.random.random_sample((100, 2))

In [11]: org_app(sample)
Out[11]: 11934.878683659041

In [12]: mod_app(sample)
Out[12]: 11934.878683659068

In [14]: %timeit org_app(sample)
10 loops, best of 3: 84.4 ms per loop

In [15]: %timeit mod_app(sample)
10000 loops, best of 3: 94.6 µs per loop

About 900x speedup! Well this should be motivating enough, hopefully to look to vectorize things whenever possible.

这篇关于为什么numpy比循环慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆