numpy和matlab之间的性能差异 [英] Difference on performance between numpy and matlab

查看:155
本文介绍了numpy和matlab之间的性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为稀疏自动编码器计算backpropagation算法.我已经使用numpymatlab在python中实现了它.代码几乎相同,但是性能却大不相同. matlab完成任务所需的时间为0.252454秒,而numpy为0.973672151566,几乎是原来的四倍.在最小化问题中,我将多次调用此代码,因此这种差异会导致实现之间的延迟几分钟.这是正常行为吗?如何提高numpy的性能?

I am computing the backpropagation algorithm for a sparse autoencoder. I have implemented it in python using numpy and in matlab. The code is almost the same, but the performance is very different. The time matlab takes to complete the task is 0.252454 seconds while numpy 0.973672151566, that is almost four times more. I will call this code several times later in a minimization problem so this difference leads to several minutes of delay between the implementations. Is this a normal behaviour? How could I improve the performance in numpy?

大量的实现:

Sparse.rho是调整参数,sparse.nodes是隐藏层中的节点数(25),sparse.input(64)是输入层中的节点数,theta1和theta2是权重矩阵第一和第二层的尺寸分别为25x64和64x25,m等于10000,rhoest的尺寸为(25,),x的尺寸为10000x64,a3 10000x64和a2 10000x25.

Sparse.rho is a tuning parameter, sparse.nodes are the number of nodes in the hidden layer (25), sparse.input (64) the number of nodes in the input layer, theta1 and theta2 are the weight matrices for the first and second layer respectively with dimensions 25x64 and 64x25, m is equal to 10000, rhoest has a dimension of (25,), x has a dimension of 10000x64, a3 10000x64 and a2 10000x25.

UPDATE:我已经按照响应的一些想法对代码进行了更改.现在的性能很麻木:0.65,而matlab:0.25.

UPDATE: I have introduced changes in the code following some of the ideas of the responses. The performance is now numpy: 0.65 vs matlab: 0.25.

partial_j1 = np.zeros(sparse.theta1.shape)
partial_j2 = np.zeros(sparse.theta2.shape)
partial_b1 = np.zeros(sparse.b1.shape)
partial_b2 = np.zeros(sparse.b2.shape)
t = time.time()

delta3t = (-(x-a3)*a3*(1-a3)).T

for i in range(m):

    delta3 = delta3t[:,i:(i+1)]
    sum1 =  np.dot(sparse.theta2.T,delta3)
    delta2 = ( sum1 + sum2 ) * a2[i:(i+1),:].T* (1 - a2[i:(i+1),:].T)
    partial_j1 += np.dot(delta2, a1[i:(i+1),:])
    partial_j2 += np.dot(delta3, a2[i:(i+1),:])
    partial_b1 += delta2
    partial_b2 += delta3

print "Backprop time:", time.time() -t

Matlab实现:

tic
for i = 1:m

    delta3 = -(data(i,:)-a3(i,:)).*a3(i,:).*(1 - a3(i,:));
    delta3 = delta3.';
    sum1 =  W2.'*delta3;
    sum2 = beta*(-sparsityParam./rhoest + (1 - sparsityParam) ./ (1.0 - rhoest) );
    delta2 = ( sum1 + sum2 ) .* a2(i,:).' .* (1 - a2(i,:).');
    W1grad = W1grad + delta2* a1(i,:);
    W2grad = W2grad + delta3* a2(i,:);
    b1grad = b1grad + delta2;
    b2grad = b2grad + delta3;
end
toc

推荐答案

说"Matlab总是比NumPy更快"是错误的. 反之亦然.通常他们的表现是可比的.当使用NumPy时,变得很好 您必须记住的性能是NumPy的速度来自于通话 用C/C ++/Fortran编写的基础函数.申请时表现不错 那些功能到整个数组.通常,在Python循环中的较小数组或标量上调用那些NumPy函数时,性能会变差.

It would be wrong to say "Matlab is always faster than NumPy" or vice versa. Often their performance is comparable. When using NumPy, to get good performance you have to keep in mind that NumPy's speed comes from calling underlying functions written in C/C++/Fortran. It performs well when you apply those functions to whole arrays. In general, you get poorer performance when you call those NumPy function on smaller arrays or scalars in a Python loop.

您问的Python循环有什么问题?通过Python循环的每次迭代都是 调用next方法.每次使用[]索引都是对 __getitem__方法.每个+=都是对__iadd__的调用.每个点缀的属性 查找(如np.dot中的查找)涉及函数调用.这些函数调用 加起来严重阻碍了速度.这些钩子给Python 表达能力-字符串索引与索引的含义有所不同 例如字典.相同的语法,不同的含义.通过赋予对象不同的__getitem__方法来实现魔术效果.

What's wrong with a Python loop you ask? Every iteration through the Python loop is a call to a next method. Every use of [] indexing is a call to a __getitem__ method. Every += is a call to __iadd__. Every dotted attribute lookup (such as in like np.dot) involves function calls. Those function calls add up to a significant hinderance to speed. These hooks give Python expressive power -- indexing for strings means something different than indexing for dicts for example. Same syntax, different meanings. The magic is accomplished by giving the objects different __getitem__ methods.

但是,这种表现力是以速度为代价的.所以当你不需要所有 动态表现力,以获得更好的性能,请尝试限制自己 NumPy函数调用整个数组.

But that expressive power comes at a cost in speed. So when you don't need all that dynamic expressivity, to get better performance, try to limit yourself to NumPy function calls on whole arrays.

因此,删除for循环;尽可能使用向量化"方程式.例如,代替

So, remove the for-loop; use "vectorized" equations when possible. For example, instead of

for i in range(m):
    delta3 = -(x[i,:]-a3[i,:])*a3[i,:]* (1 - a3[i,:])    

您可以一次为每个i计算delta3:

you can compute delta3 for each i all at once:

delta3 = -(x-a3)*a3*(1-a3)

for-loop中,delta3是向量,而使用矢量化方程式delta3是矩阵.

Whereas in the for-loop delta3 is a vector, using the vectorized equation delta3 is a matrix.

for-loop中的某些计算不依赖于i,因此应将其提升到循环之外.例如,sum2看起来像一个常量:

Some of the computations in the for-loop do not depend on i and therefore should be lifted outside the loop. For example, sum2 looks like a constant:

sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest) )


这是一个可运行的示例,其中包含代码(orig)的替代实现(alt).


Here is a runnable example with an alternative implementation (alt) of your code (orig).

我的timeit基准测试显示速度提高了 6.8倍:

My timeit benchmark shows a 6.8x improvement in speed:

In [52]: %timeit orig()
1 loops, best of 3: 495 ms per loop

In [53]: %timeit alt()
10 loops, best of 3: 72.6 ms per loop


import numpy as np


class Bunch(object):
    """ http://code.activestate.com/recipes/52308 """
    def __init__(self, **kwds):
        self.__dict__.update(kwds)

m, n, p = 10 ** 4, 64, 25

sparse = Bunch(
    theta1=np.random.random((p, n)),
    theta2=np.random.random((n, p)),
    b1=np.random.random((p, 1)),
    b2=np.random.random((n, 1)),
)

x = np.random.random((m, n))
a3 = np.random.random((m, n))
a2 = np.random.random((m, p))
a1 = np.random.random((m, n))
sum2 = np.random.random((p, ))
sum2 = sum2[:, np.newaxis]

def orig():
    partial_j1 = np.zeros(sparse.theta1.shape)
    partial_j2 = np.zeros(sparse.theta2.shape)
    partial_b1 = np.zeros(sparse.b1.shape)
    partial_b2 = np.zeros(sparse.b2.shape)
    delta3t = (-(x - a3) * a3 * (1 - a3)).T
    for i in range(m):
        delta3 = delta3t[:, i:(i + 1)]
        sum1 = np.dot(sparse.theta2.T, delta3)
        delta2 = (sum1 + sum2) * a2[i:(i + 1), :].T * (1 - a2[i:(i + 1), :].T)
        partial_j1 += np.dot(delta2, a1[i:(i + 1), :])
        partial_j2 += np.dot(delta3, a2[i:(i + 1), :])
        partial_b1 += delta2
        partial_b2 += delta3
        # delta3: (64, 1)
        # sum1: (25, 1)
        # delta2: (25, 1)
        # a1[i:(i+1),:]: (1, 64)
        # partial_j1: (25, 64)
        # partial_j2: (64, 25)
        # partial_b1: (25, 1)
        # partial_b2: (64, 1)
        # a2[i:(i+1),:]: (1, 25)
    return partial_j1, partial_j2, partial_b1, partial_b2


def alt():
    delta3 = (-(x - a3) * a3 * (1 - a3)).T
    sum1 = np.dot(sparse.theta2.T, delta3)
    delta2 = (sum1 + sum2) * a2.T * (1 - a2.T)
    # delta3: (64, 10000)
    # sum1: (25, 10000)
    # delta2: (25, 10000)
    # a1: (10000, 64)
    # a2: (10000, 25)
    partial_j1 = np.dot(delta2, a1)
    partial_j2 = np.dot(delta3, a2)
    partial_b1 = delta2.sum(axis=1)
    partial_b2 = delta3.sum(axis=1)
    return partial_j1, partial_j2, partial_b1, partial_b2

answer = orig()
result = alt()
for a, r in zip(answer, result):
    try:
        assert np.allclose(np.squeeze(a), r)
    except AssertionError:
        print(a.shape)
        print(r.shape)
        raise


提示:请注意,我在注释中保留了所有中间数组的形状.了解数组的形状有助于我理解您的代码在做什么.数组的形状可以帮助您指导正确的NumPy函数使用.或者至少,注意形状可以帮助您知道操作是否明智.例如,当您计算


Tip: Notice that I left in the comments the shape of all the intermediate arrays. Knowing the shape of the arrays helped me understand what your code was doing. The shape of the arrays can help guide you toward the right NumPy functions to use. Or at least, paying attention to the shapes can help you know if an operation is sensible. For example, when you compute

np.dot(A, B)

A.shape = (n, m)B.shape = (m, p),则np.dot(A, B)将是形状为(n, p)的数组.

and A.shape = (n, m) and B.shape = (m, p), then np.dot(A, B) will be an array of shape (n, p).

它可以帮助以C_CONTIGUOUS顺序构建数组(至少,如果使用np.dot).这样做最多可以使速度提高3倍:

It can help to build the arrays in C_CONTIGUOUS-order (at least, if using np.dot). There might be as much as a 3x speed up by doing so:

以下,xxf相同,除了x是C_CONTIGUOUS并且 xf是F_CONTIGUOUS,并且yyf的关系相同.

Below, x is the same as xf except that x is C_CONTIGUOUS and xf is F_CONTIGUOUS -- and the same relationship for y and yf.

import numpy as np

m, n, p = 10 ** 4, 64, 25
x = np.random.random((n, m))
xf = np.asarray(x, order='F')

y = np.random.random((m, n))
yf = np.asarray(y, order='F')

assert np.allclose(x, xf)
assert np.allclose(y, yf)
assert np.allclose(np.dot(x, y), np.dot(xf, y))
assert np.allclose(np.dot(x, y), np.dot(xf, yf))

%timeit基准测试显示了速度差异:

%timeit benchmarks show the difference in speed:

In [50]: %timeit np.dot(x, y)
100 loops, best of 3: 12.9 ms per loop

In [51]: %timeit np.dot(xf, y)
10 loops, best of 3: 27.7 ms per loop

In [56]: %timeit np.dot(x, yf)
10 loops, best of 3: 21.8 ms per loop

In [53]: %timeit np.dot(xf, yf)
10 loops, best of 3: 33.3 ms per loop


关于Python基准测试:

使用成对的time.time()调用中的差异来测试Python代码的速度可能会误导 . 您需要多次重复测量.最好禁用自动垃圾收集器.测量较长的时间跨度(例如至少10秒的重复时间)也很重要,以避免由于时钟计时器的分辨率差而导致的错误,并减少time.time呼叫开销的重要性. Python提供了 timeit模块,而不是您自己编写所有代码.为了方便起见,我基本上是在用这段时间来计时代码段,只是为了方便起见,我是通过 IPython终端调用它

It can be misleading to use the difference in pairs of time.time() calls to benchmark the speed of code in Python. You need to repeat the measurement many times. It's better to disable the automatic garbage collector. It is also important to measure large spans of time (such as at least 10 seconds worth of repetitions) to avoid errors due to poor resolution in the clock timer and to reduce the significance of time.time call overhead. Instead of writing all that code yourself, Python provides you with the timeit module. I'm essentially using that to time the pieces of code, except that I'm calling it through an IPython terminal for convenience.

我不确定这是否会影响您的基准测试,但请注意这可能会有所作为.在我链接到的问题中,根据time.time,两段代码相差1.7倍,而基准测试使用timeit显示代码段的运行时间基本上相同.

I'm not sure if this is affecting your benchmarks, but be aware it could make a difference. In the question I linked to, according to time.time two pieces of code differed by a factor of 1.7x while benchmarks using timeit showed the pieces of code ran in essentially identical amounts of time.

这篇关于numpy和matlab之间的性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆