为什么numpy的einsum比numpy的内置函数要慢? [英] Why is numpy's einsum slower than numpy's built-in functions?

查看:107
本文介绍了为什么numpy的einsum比numpy的内置函数要慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常我通过numpy的einsum函数获得了良好的性能(我喜欢它的语法). @Ophion对此问题的回答表明-对于经过测试的情况-einsum始终优于内置"功能(有时会稍微好一些,有时会很多).但是我刚遇到einsum慢得多的情况.考虑以下等效功能:

I've usually gotten good performance out of numpy's einsum function (and I like it's syntax). @Ophion's answer to this question shows that - for the cases tested - einsum consistently outperforms the "built-in" functions (sometimes by a little, sometimes by a lot). But I just encountered a case where einsum is much slower. Consider the following equivalent functions:

(M, K) = (1000000, 20)
C = np.random.rand(K, K)
X = np.random.rand(M, K)

def func_dot(C, X):
    Y = X.dot(C)
    return np.sum(Y * X, axis=1)

def func_einsum(C, X):
    return np.einsum('ik,km,im->i', X, C, X)

def func_einsum2(C, X):
    # Like func_einsum but break it into two steps.
    A = np.einsum('ik,km', X, C)
    return np.einsum('ik,ik->i', A, X)

我希望func_einsum运行最快,但这不是我遇到的.在具有超线程,numpy版本1.9.0.dev-7ae0206和OpenBLAS的多线程的四核CPU上运行,我得到以下结果:

I expected func_einsum to run fastest but that is not what I encounter. Running on a quad-core cpu with hyperthreading, numpy version 1.9.0.dev-7ae0206, and multithreading with OpenBLAS, I get the following results:

In [2]: %time y1 = func_dot(C, X)
CPU times: user 320 ms, sys: 312 ms, total: 632 ms
Wall time: 209 ms
In [3]: %time y2 = func_einsum(C, X)
CPU times: user 844 ms, sys: 0 ns, total: 844 ms
Wall time: 842 ms
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 292 ms, sys: 44 ms, total: 336 ms
Wall time: 334 ms

当我将K增加到200时,差异更加极端:

When I increase K to 200, the differences are more extreme:

In [2]: %time y1= func_dot(C, X)
CPU times: user 4.5 s, sys: 1.02 s, total: 5.52 s
Wall time: 2.3 s
In [3]: %time y2= func_einsum(C, X)
CPU times: user 1min 16s, sys: 44 ms, total: 1min 16s
Wall time: 1min 16s
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 15.3 s, sys: 312 ms, total: 15.6 s
Wall time: 15.6 s

有人可以解释为什么einsum这么慢吗?

Can someone explain why einsum is so much slower here?

如果有关系,这是我的numpy配置:

If it matters, here is my numpy config:

In [6]: np.show_config()
lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
atlas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = c
    include_dirs = ['/usr/local/include']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
atlas_blas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = f77
    include_dirs = ['/usr/local/include']
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

推荐答案

您可以两全其美:

def func_dot_einsum(C, X):
    Y = X.dot(C)
    return np.einsum('ij,ij->i', Y, X)

在我的系统上:

In [7]: %timeit func_dot(C, X)
10 loops, best of 3: 31.1 ms per loop

In [8]: %timeit func_einsum(C, X)
10 loops, best of 3: 105 ms per loop

In [9]: %timeit func_einsum2(C, X)
10 loops, best of 3: 43.5 ms per loop

In [10]: %timeit func_dot_einsum(C, X)
10 loops, best of 3: 21 ms per loop

如果可用,np.dot使用BLAS,MKL或您拥有的任何库.因此,对np.dot的调用几乎可以肯定是多线程的. np.einsum有其自己的循环,因此除了使用SIMD来加快原始C实现的速度外,不使用任何这些优化.

When available, np.dot uses BLAS, MKL, or whatever library you have . So the call to np.dot is almost certainly being multithreaded. np.einsum has its own loops, so doesn't use any of those optimizations, apart from its own use of SIMD to speed things up over a vanilla C implementation.

然后是一个多输入的einsum调用,运行速度要慢得多... einsum的numpy源非常复杂,我还不完全了解.因此,请注意以下内容充其量只是推测性的,但这是我认为正在发生的事情……

Then there's the multi-input einsum call that runs much slower... The numpy source for einsum is very complex and I don't fully understand it. So be advised that the following is speculative at best, but here's what I think is going on...

运行np.einsum('ij,ij->i', a, b)之类的东西时,与执行np.sum(a*b, axis=1)相比,其好处在于避免了必须实例化具有所有乘积的中间数组,并在其上循环两次.因此,从低层次来看,情况是这样的:

When you run something like np.einsum('ij,ij->i', a, b), the benefit over doing np.sum(a*b, axis=1) comes from avoiding having to instantiate the intermediate array with all the products, and looping twice over it. So at the low level what goes on is something like:

for i in range(I):
    out[i] = 0
    for j in range(J):
        out[i] += a[i, j] * b[i, j]

现在说您正在追求类似的东西:

Say now that you are after something like:

np.einsum('ij,jk,ik->i', a, b, c)

您可以执行与

np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))

我认为einsum要做的是运行最后的代码而不必实例化巨大的中间数组,这肯定会使事情变得更快:

And what I think einsum does is to run this last code without having to instantiate the huge intermediate array, which certainly makes things faster:

In [29]: a, b, c = np.random.rand(3, 100, 100)

In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c)
100 loops, best of 3: 2.41 ms per loop

In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
100 loops, best of 3: 12.3 ms per loop

但是,如果仔细看,摆脱中间存储可能是一件可怕的事情.我认为einsum的工作是这样的:

But if you look at it carefully, getting rid of intermediate storage can be a terrible thing. This is what I think einsum is doing at the low level:

for i in range(I):
    out[i] = 0
    for j in range(J):
        for k in range(K):
            out[i] += a[i, j] * b[j, k] * c[i, k]

但是您要重复大量操作!如果您改为:

But you are repeating a ton of operations! If you instead did:

for i in range(I):
    out[i] = 0
    for j in range(J):
        temp = 0
        for k in range(K):
            temp += b[j, k] * c[i, k]
        out[i] += a[i, j] * temp

您将减少I * J * (K-1)乘法(和I * J额外加法),并节省大量时间.我的猜测是,einsum不够聪明,无法在此级别优化事物.在源代码提示它只能优化1个或2个操作数的操作,而不是3个操作数.在任何情况下,对通用输入进行自动化似乎都不是一件简单的事情……

you would be doing I * J * (K-1) less multiplications (and I * J extra additions), and save yourself a ton of time. My guess is that einsum is not smart enough to optimize things at this level. In the source code there is a hint that it only optimizes operations with 1 or 2 operands, not 3. In any case automating this for general inputs seems like anything but simple...

这篇关于为什么numpy的einsum比numpy的内置函数要慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆