为什么numpy的的einsum速度比numpy的内置的功能呢? [英] Why is numpy's einsum faster than numpy's built in functions?

查看:284
本文介绍了为什么numpy的的einsum速度比numpy的内置的功能呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们开始与 DTYPE = np.double 的三个数组。时序使用numpy的1.7.1与编译ICC A Intel的CPU上执行并与英特尔的 MKL 。一个AMD的CPU与numpy的1.6.1与 GCC 编译没有 MKL 也被用来验证的时序。请注意,定时与系统规模的增加几乎呈线性,并没有由于numpy的职能发生了小的开销如果语句,这些差异在微秒不会毫秒显​​示:

Lets start with three arrays of dtype=np.double. Timings are performed on a intel CPU using numpy 1.7.1 compiled with icc and linked to intel's mkl. A AMD cpu with numpy 1.6.1 compiled with gcc without mkl was also used to verify the timings. Please note the timings scale nearly linearly with system size and are not due to the small overhead incurred in the numpy functions if statements these difference will show up in microseconds not milliseconds:

arr_1D=np.arange(500,dtype=np.double)
large_arr_1D=np.arange(100000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)

首先让我们看看 np.sum 功能:

np.all(np.sum(arr_3D)==np.einsum('ijk->',arr_3D))
True

%timeit np.sum(arr_3D)
10 loops, best of 3: 142 ms per loop

%timeit np.einsum('ijk->', arr_3D)
10 loops, best of 3: 70.2 ms per loop

权力:

np.allclose(arr_3D*arr_3D*arr_3D,np.einsum('ijk,ijk,ijk->ijk',arr_3D,arr_3D,arr_3D))
True

%timeit arr_3D*arr_3D*arr_3D
1 loops, best of 3: 1.32 s per loop

%timeit np.einsum('ijk,ijk,ijk->ijk', arr_3D, arr_3D, arr_3D)
1 loops, best of 3: 694 ms per loop

外积:

np.all(np.outer(arr_1D,arr_1D)==np.einsum('i,k->ik',arr_1D,arr_1D))
True

%timeit np.outer(arr_1D, arr_1D)
1000 loops, best of 3: 411 us per loop

%timeit np.einsum('i,k->ik', arr_1D, arr_1D)
1000 loops, best of 3: 245 us per loop

以上所有的都是快两倍 np.einsum 。这些应该是苹果,苹果的比较作为一切都是专门 DTYPE = np.double 。我希望在操作速度像这样:

All of the above are twice as fast with np.einsum. These should be apples to apples comparisons as everything is specifically of dtype=np.double. I would expect the speed up in an operation like this:

np.allclose(np.sum(arr_2D*arr_3D),np.einsum('ij,oij->',arr_2D,arr_3D))
True

%timeit np.sum(arr_2D*arr_3D)
1 loops, best of 3: 813 ms per loop

%timeit np.einsum('ij,oij->', arr_2D, arr_3D)
10 loops, best of 3: 85.1 ms per loop

Einsum似乎至少快两倍,为 np.inner np.outer np.kron np.sum 无论选择。主要的例外是 np.dot ,因为它从一个BLAS库调用DGEMM。那么,为什么是 np.einsum 更快,其他功能numpy的是等效的?

Einsum seems to be at least twice as fast for np.inner, np.outer, np.kron, and np.sum regardless of axes selection. The primary exception being np.dot as it calls DGEMM from a BLAS library. So why is np.einsum faster that other numpy functions that are equivalent?

该DGEMM案例完整性:

The DGEMM case for completeness:

np.allclose(np.dot(arr_2D,arr_2D),np.einsum('ij,jk',arr_2D,arr_2D))
True

%timeit np.einsum('ij,jk',arr_2D,arr_2D)
10 loops, best of 3: 56.1 ms per loop

%timeit np.dot(arr_2D,arr_2D)
100 loops, best of 3: 5.17 ms per loop


主导理论是从@sebergs评论说 np.einsum 可利用的 SSE2 ,但numpy的的ufuncs不会直到numpy的1.8(见的更改日志)。我相信这是正确的答案,但有的的能够证实它。一些有限的证据可以通过改变输入数组的DTYPE和观察的速度差异,并不是每个人都遵守相同的趋势时序的事实被发现。


The leading theory is from @sebergs comment that np.einsum can make use of SSE2, but numpy's ufuncs will not until numpy 1.8 (see the change log). I believe this is the correct answer, but have not been able to confirm it. Some limited proof can be found by changing the dtype of input array and observing speed difference and the fact that not everyone observes the same trends in timings.

推荐答案

现在,numpy的1.8被释放,其中根据文档的所有ufuncs应该使用SSE2,我想仔细检查Seberg的约SSE2评论是有效的。

Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.

要执行测试一个新的Python 2.7安装在created- numpy的1.7和1.8,用使用上运行Ubuntu一个AMD皓龙核心的标准选项ICC 编译。

To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc using standard options on a AMD opteron core running Ubuntu.

这是试运行前和1.8升级后:

This is the test run both before and after the 1.8 upgrade:

import numpy as np
import timeit

arr_1D=np.arange(5000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)

print 'Summation test:'
print timeit.timeit('np.sum(arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ijk->", arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Power test:'
print timeit.timeit('arr_3D*arr_3D*arr_3D',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Outer test:'
print timeit.timeit('np.outer(arr_1D, arr_1D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Einsum test:'
print timeit.timeit('np.sum(arr_2D*arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'

numpy的1.7.1:

Numpy 1.7.1:

Summation test:
0.172988510132
0.0934836149216
----------------------

Power test:
1.93524689674
0.839519000053
----------------------

Outer test:
0.130380821228
0.121401786804
----------------------

Einsum test:
0.979052495956
0.126066613197

numpy的1.8:

Numpy 1.8:

Summation test:
0.116551589966
0.0920487880707
----------------------

Power test:
1.23683619499
0.815982818604
----------------------

Outer test:
0.131808176041
0.127472200394
----------------------

Einsum test:
0.781750011444
0.129271841049

我觉得这是相当确凿的上证中扮演着时间性差异很大的作用,但应注意的是,仅〜0.003s重复这些测试时序非常。剩下的差额应在其他回答这个问题被覆盖。

I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.

这篇关于为什么numpy的的einsum速度比numpy的内置的功能呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆