为什么 numpy 的 einsum 比 numpy 的内置函数快? [英] Why is numpy's einsum faster than numpy's built in functions?

查看:50
本文介绍了为什么 numpy 的 einsum 比 numpy 的内置函数快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们从三个 dtype=np.double 数组开始.计时是在英特尔 CPU 上使用 numpy 1.7.1 执行的,使用 icc 编译并链接到英特尔的 mkl.使用 gcc 编译而没有 mkl 的 numpy 1.6.1 AMD cpu 也用于验证时序.请注意,时间与系统大小几乎呈线性关系,并不是由于 numpy 函数 if 语句中产生的小开销,这些差异将以微秒而不是毫秒显示:

Lets start with three arrays of dtype=np.double. Timings are performed on a intel CPU using numpy 1.7.1 compiled with icc and linked to intel's mkl. A AMD cpu with numpy 1.6.1 compiled with gcc without mkl was also used to verify the timings. Please note the timings scale nearly linearly with system size and are not due to the small overhead incurred in the numpy functions if statements these difference will show up in microseconds not milliseconds:

arr_1D=np.arange(500,dtype=np.double)
large_arr_1D=np.arange(100000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)

首先让我们看看 np.sum 函数:

First lets look at the np.sum function:

np.all(np.sum(arr_3D)==np.einsum('ijk->',arr_3D))
True

%timeit np.sum(arr_3D)
10 loops, best of 3: 142 ms per loop

%timeit np.einsum('ijk->', arr_3D)
10 loops, best of 3: 70.2 ms per loop

权力:

np.allclose(arr_3D*arr_3D*arr_3D,np.einsum('ijk,ijk,ijk->ijk',arr_3D,arr_3D,arr_3D))
True

%timeit arr_3D*arr_3D*arr_3D
1 loops, best of 3: 1.32 s per loop

%timeit np.einsum('ijk,ijk,ijk->ijk', arr_3D, arr_3D, arr_3D)
1 loops, best of 3: 694 ms per loop

外积:

np.all(np.outer(arr_1D,arr_1D)==np.einsum('i,k->ik',arr_1D,arr_1D))
True

%timeit np.outer(arr_1D, arr_1D)
1000 loops, best of 3: 411 us per loop

%timeit np.einsum('i,k->ik', arr_1D, arr_1D)
1000 loops, best of 3: 245 us per loop

使用 np.einsum 时,上述所有操作的速度都快两倍.这些应该是苹果对苹果的比较,因为所有内容都是 dtype=np.double.我希望在这样的操作中加速:

All of the above are twice as fast with np.einsum. These should be apples to apples comparisons as everything is specifically of dtype=np.double. I would expect the speed up in an operation like this:

np.allclose(np.sum(arr_2D*arr_3D),np.einsum('ij,oij->',arr_2D,arr_3D))
True

%timeit np.sum(arr_2D*arr_3D)
1 loops, best of 3: 813 ms per loop

%timeit np.einsum('ij,oij->', arr_2D, arr_3D)
10 loops, best of 3: 85.1 ms per loop

对于 np.innernp.outernp.kron,Einsum 似乎至少快两倍np.sumaxes 选择无关.主要的例外是 np.dot,因为它从 BLAS 库调用 DGEMM.那么为什么 np.einsum 比其他等效的 numpy 函数更快?

Einsum seems to be at least twice as fast for np.inner, np.outer, np.kron, and np.sum regardless of axes selection. The primary exception being np.dot as it calls DGEMM from a BLAS library. So why is np.einsum faster that other numpy functions that are equivalent?

DGEMM 完整性案例:

The DGEMM case for completeness:

np.allclose(np.dot(arr_2D,arr_2D),np.einsum('ij,jk',arr_2D,arr_2D))
True

%timeit np.einsum('ij,jk',arr_2D,arr_2D)
10 loops, best of 3: 56.1 ms per loop

%timeit np.dot(arr_2D,arr_2D)
100 loops, best of 3: 5.17 ms per loop

<小时>

主要理论来自@sebergs 评论,即 np.einsum 可以利用 SSE2,但是 numpy 的 ufuncs 直到 numpy 1.8 才会出现(请参阅 更改日志).我相信这是正确的答案,但没有能够确认它.通过更改输入数组的 dtype 并观察速度差异以及并非每个人都观察到相同的时序趋势这一事实,可以找到一些有限的证据.


The leading theory is from @sebergs comment that np.einsum can make use of SSE2, but numpy's ufuncs will not until numpy 1.8 (see the change log). I believe this is the correct answer, but have not been able to confirm it. Some limited proof can be found by changing the dtype of input array and observing speed difference and the fact that not everyone observes the same trends in timings.

推荐答案

现在 numpy 1.8 发布了,根据文档,所有 ufunc 都应该使用 SSE2,我想再次检查 Seberg 关于 SSE2 的评论是否有效.

Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.

为了执行测试,创建了一个新的 python 2.7 安装 - numpy 1.7 和 1.8 使用 icc 在运行 Ubuntu 的 AMD opteron 内核上使用标准选项编译.

To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc using standard options on a AMD opteron core running Ubuntu.

这是1.8升级前后的试运行:

This is the test run both before and after the 1.8 upgrade:

import numpy as np
import timeit

arr_1D=np.arange(5000,dtype=np.double)
arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)
arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)

print 'Summation test:'
print timeit.timeit('np.sum(arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ijk->", arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Power test:'
print timeit.timeit('arr_3D*arr_3D*arr_3D',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Outer test:'
print timeit.timeit('np.outer(arr_1D, arr_1D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'


print 'Einsum test:'
print timeit.timeit('np.sum(arr_2D*arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',
                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',
                      number=5)/5
print '----------------------\n'

Numpy 1.7.1:

Numpy 1.7.1:

Summation test:
0.172988510132
0.0934836149216
----------------------

Power test:
1.93524689674
0.839519000053
----------------------

Outer test:
0.130380821228
0.121401786804
----------------------

Einsum test:
0.979052495956
0.126066613197

Numpy 1.8:

Summation test:
0.116551589966
0.0920487880707
----------------------

Power test:
1.23683619499
0.815982818604
----------------------

Outer test:
0.131808176041
0.127472200394
----------------------

Einsum test:
0.781750011444
0.129271841049

我认为这是相当确定的,SSE 在时间差异中起着很大的作用,应该注意的是,重复这些测试时间只有 ~0.003 秒.其余的区别应该在这个问题的其他答案中讨论.

I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.

这篇关于为什么 numpy 的 einsum 比 numpy 的内置函数快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆