为什么 numpy 的 einsum 比 numpy 的内置函数快? [英] Why is numpy's einsum faster than numpy's built in functions?
问题描述
让我们从三个 dtype=np.double
数组开始.计时是在英特尔 CPU 上使用 numpy 1.7.1 执行的,使用 icc
编译并链接到英特尔的 mkl
.使用 gcc
编译而没有 mkl
的 numpy 1.6.1 AMD cpu 也用于验证时序.请注意,时间与系统大小几乎呈线性关系,并不是由于 numpy 函数 if
语句中产生的小开销,这些差异将以微秒而不是毫秒显示:
Lets start with three arrays of dtype=np.double
. Timings are performed on a intel CPU using numpy 1.7.1 compiled with icc
and linked to intel's mkl
. A AMD cpu with numpy 1.6.1 compiled with gcc
without mkl
was also used to verify the timings. Please note the timings scale nearly linearly with system size and are not due to the small overhead incurred in the numpy functions if
statements these difference will show up in microseconds not milliseconds:
arr_1D=np.arange(500,dtype=np.double)
large_arr_1D=np.arange(100000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)
首先让我们看看 np.sum
函数:
First lets look at the np.sum
function:
np.all(np.sum(arr_3D)==np.einsum('ijk->',arr_3D))
True
%timeit np.sum(arr_3D)
10 loops, best of 3: 142 ms per loop
%timeit np.einsum('ijk->', arr_3D)
10 loops, best of 3: 70.2 ms per loop
权力:
np.allclose(arr_3D*arr_3D*arr_3D,np.einsum('ijk,ijk,ijk->ijk',arr_3D,arr_3D,arr_3D))
True
%timeit arr_3D*arr_3D*arr_3D
1 loops, best of 3: 1.32 s per loop
%timeit np.einsum('ijk,ijk,ijk->ijk', arr_3D, arr_3D, arr_3D)
1 loops, best of 3: 694 ms per loop
外积:
np.all(np.outer(arr_1D,arr_1D)==np.einsum('i,k->ik',arr_1D,arr_1D))
True
%timeit np.outer(arr_1D, arr_1D)
1000 loops, best of 3: 411 us per loop
%timeit np.einsum('i,k->ik', arr_1D, arr_1D)
1000 loops, best of 3: 245 us per loop
使用 np.einsum
时,上述所有操作的速度都快两倍.这些应该是苹果对苹果的比较,因为所有内容都是 dtype=np.double
.我希望在这样的操作中加速:
All of the above are twice as fast with np.einsum
. These should be apples to apples comparisons as everything is specifically of dtype=np.double
. I would expect the speed up in an operation like this:
np.allclose(np.sum(arr_2D*arr_3D),np.einsum('ij,oij->',arr_2D,arr_3D))
True
%timeit np.sum(arr_2D*arr_3D)
1 loops, best of 3: 813 ms per loop
%timeit np.einsum('ij,oij->', arr_2D, arr_3D)
10 loops, best of 3: 85.1 ms per loop
对于 np.inner
、np.outer
、np.kron
和 ,Einsum 似乎至少快两倍np.sum
与 axes
选择无关.主要的例外是 np.dot
,因为它从 BLAS 库调用 DGEMM.那么为什么 np.einsum
比其他等效的 numpy 函数更快?
Einsum seems to be at least twice as fast for np.inner
, np.outer
, np.kron
, and np.sum
regardless of axes
selection. The primary exception being np.dot
as it calls DGEMM from a BLAS library. So why is np.einsum
faster that other numpy functions that are equivalent?
DGEMM 完整性案例:
The DGEMM case for completeness:
np.allclose(np.dot(arr_2D,arr_2D),np.einsum('ij,jk',arr_2D,arr_2D))
True
%timeit np.einsum('ij,jk',arr_2D,arr_2D)
10 loops, best of 3: 56.1 ms per loop
%timeit np.dot(arr_2D,arr_2D)
100 loops, best of 3: 5.17 ms per loop
<小时>
主要理论来自@sebergs 评论,即 np.einsum
可以利用 SSE2,但是 numpy 的 ufuncs 直到 numpy 1.8 才会出现(请参阅 更改日志).我相信这是正确的答案,但没有能够确认它.通过更改输入数组的 dtype 并观察速度差异以及并非每个人都观察到相同的时序趋势这一事实,可以找到一些有限的证据.
The leading theory is from @sebergs comment that np.einsum
can make use of SSE2, but numpy's ufuncs will not until numpy 1.8 (see the change log). I believe this is the correct answer, but have not been able to confirm it. Some limited proof can be found by changing the dtype of input array and observing speed difference and the fact that not everyone observes the same trends in timings.
推荐答案
现在 numpy 1.8 发布了,根据文档,所有 ufunc 都应该使用 SSE2,我想再次检查 Seberg 关于 SSE2 的评论是否有效.
Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.
为了执行测试,创建了一个新的 python 2.7 安装 - numpy 1.7 和 1.8 使用 icc
在运行 Ubuntu 的 AMD opteron 内核上使用标准选项编译.
To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc
using standard options on a AMD opteron core running Ubuntu.
这是1.8升级前后的试运行:
This is the test run both before and after the 1.8 upgrade:
import numpy as np
import timeit
arr_1D=np.arange(5000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)
print 'Summation test:'
print timeit.timeit('np.sum(arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk->", arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Power test:'
print timeit.timeit('arr_3D*arr_3D*arr_3D',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Outer test:'
print timeit.timeit('np.outer(arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
print 'Einsum test:'
print timeit.timeit('np.sum(arr_2D*arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',
'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
number=5)/5
print '----------------------\n'
Numpy 1.7.1:
Numpy 1.7.1:
Summation test:
0.172988510132
0.0934836149216
----------------------
Power test:
1.93524689674
0.839519000053
----------------------
Outer test:
0.130380821228
0.121401786804
----------------------
Einsum test:
0.979052495956
0.126066613197
Numpy 1.8:
Summation test:
0.116551589966
0.0920487880707
----------------------
Power test:
1.23683619499
0.815982818604
----------------------
Outer test:
0.131808176041
0.127472200394
----------------------
Einsum test:
0.781750011444
0.129271841049
我认为这是相当确定的,SSE 在时间差异中起着很大的作用,应该注意的是,重复这些测试时间只有 ~0.003 秒.其余的区别应该在这个问题的其他答案中讨论.
I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.
这篇关于为什么 numpy 的 einsum 比 numpy 的内置函数快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!