矩阵乘法与迭代器的依赖 - numpy的 [英] Matrix multiplication with iterator dependency - NumPy
问题描述
有时回<一个href=\"http://stackoverflow.com/questions/36042556/numpy-multiplication-anything-more-than-tensordot\"><$c$c>this问题 (现已删除,但10K +代表用户仍然可以观看)被张贴。它看起来有趣,我和我学到新的东西出现,而试图解决它,我觉得这是值得分享。我想张贴这些想法/解决方案,并希望能看到人岗其他可能的方式来解决这个问题。我张贴下一问题的要点。
所以,我们有两个numpy的ndarrays A
和 B
形状:
A:(M,N,N)
A:(N,M,N)
让我们假设我们正在处理,其中 M
, N
&放例; N
具有可比性。
问题是要解决以下乘法和求和,重点是性能
DEF all_loopy(A,B):
P,Q,N = a.shape
D = np.zeros(N)
因为我在范围(N):
在范围(一)记者:
对于k在范围(P):在
在范围(Q)N:
D [I] + =一个[K,N,I] * B [N,K,j]的
ð回报
我学到了一些东西一起试图找到矢量更快的方式来解决它的方式。
1)首先,有迭代器在的依赖的j范围(I)
。从我的previous的经验,尤其是在试图解决关于 MATLAB
这样的问题,就出现了这样的相关性可以用一个的 下三角矩阵
,这样的 np.tril
应在那里工作。因此,一个完全量化的解决方案并没有那么内存有效的解决方案(因为它创造了一个中间(N,N)
最后才降低到形阵列(N ,)
形阵列)是 -
DEF fully_vectorized(A,B):
返回np.tril(np.einsum('IJK,jil-&GT; KL',A,B), - 1)的.sum(1)
2)下招/想法是保留一个循环为的迭代器
,但插入一个索引的依赖和使用 I
为我的range(N) np.einsum
来执行所有的乘法和求和。的优点将是存储器效率。实施应该是这样的 -
DEF einsum_oneloop(A,B):
D = np.zeros(N)
因为我在范围(N):
D [I] = np.einsum('IJ,jik-&GT;',一个[:,:,我],B [:,:,np.arange(ⅰ)])
ð回报
有两种解决这个问题更明显的方式。所以,如果我们开始从原来的 all_loopy
解决方案时,我们可以保持外两个回路,并使用 np.einsum
或 np.tensordot
来执行这些操作,从而取下内两个循环,像这样 -
DEF tensordot_twoloop(A,B):
D = np.zeros(N)
因为我在范围(N):
在范围(一)记者:
D [I] + = np.tensordot(一个[:,:,我],B [:,:,j]时,轴=([1,0],[0,1]))
ð回报DEF einsum_twoloop(A,B):
D = np.zeros(N)
因为我在范围(N):
在范围(一)记者:
D [I] + = np.einsum('IJ,霁&GT;',一个[:,:,我],B [:,:,j]的)
ð回报
运行测试
让我们比较所有的五种方法迄今发布到解决问题,其中包括一张贴在的问题。
案例#1:
在[26]:#输入阵列,随机元素
...:M,N,N = 20,20,20
...:A = np.random.rand(M,N,N)
...:B = np.random.rand(N,M,N)
...:在[27]:%timeit all_loopy(A,B)
...:%timeit tensordot_twoloop(A,B)
...:%timeit einsum_twoloop(A,B)
...:%timeit einsum_oneloop(A,B)
...:%timeit fully_vectorized(A,B)
...:
10圈,最好的3:每循环79.6毫秒
100圈,最好的3:每循环4.97毫秒
1000循环,最好的3:每循环1.66毫秒
1000次循环,最好的3:每回路585微秒
1000循环,最好的3:每圈684微秒
案例#2:
在[28]:#输入阵列,随机元素
...:M,N,N = 50,50,50
...:A = np.random.rand(M,N,N)
...:B = np.random.rand(N,M,N)
...:在[29]:%timeit all_loopy(A,B)
...:%timeit tensordot_twoloop(A,B)
...:%timeit einsum_twoloop(A,B)
...:%timeit einsum_oneloop(A,B)
...:%timeit fully_vectorized(A,B)
...:
1循环,最好的3:每环3.1小号
10圈,最好的3:每循环54.1毫秒
10圈,最好的3:每循环26.2毫秒
10圈,最好的3:每循环27毫秒
10圈,最好的3:每循环23.3毫秒
案例#3(离开了all_loopy为是非常慢):
在[30]:#输入阵列,随机元素
...:M,N,N = 100,100,100
...:A = np.random.rand(M,N,N)
...:B = np.random.rand(N,M,N)
...:在[31]:%timeit tensordot_twoloop(A,B)
...:%timeit einsum_twoloop(A,B)
...:%timeit einsum_oneloop(A,B)
...:%timeit fully_vectorized(A,B)
...:
1循环,最好的3:每循环1.08小号
1循环,最好的3:每循环744毫秒
1循环,最好的3:每循环568毫秒
1循环,最好的3:每循环866毫秒
用数字去, einsum_oneloop
看起来pretty对我好,而 fully_vectorized
可时使用处理小体面大小的数组!
Sometime back this question
(now deleted but 10K+ rep users can still view it) was posted. It looked interesting to me and I learnt something new there while trying to solve it and I thought that's worth sharing. I would like to post those ideas/solutions and would love to see people post other possible ways to solve it. I am posting the gist of the question next.
So, we have two NumPy ndarrays a
and b
of shapes :
a : (m,n,N)
b : (n,m,N)
Let's assume we are dealing with cases where m
,n
& N
are comparable.
The problem is to solve the following multiplication and summation with focus on performance :
def all_loopy(a,b):
P,Q,N = a.shape
d = np.zeros(N)
for i in range(N):
for j in range(i):
for k in range(P):
for n in range(Q):
d[i] += a[k,n,i] * b[n,k,j]
return d
I learnt few things along the way trying to find vectorized and faster ways to solve it.
1) First off, there is a dependency of iterators at "for j in range(i)"
. From my previous experience, especially with trying to solve such problems on MATLAB
, it appeared that such dependency could be taken care of with a lower triangular matrix
, so np.tril
should work there. Thus, a fully vectorized solution and not so memory efficient solution (as it creates an intermediate (N,N)
shaped array before finally reducing to (N,)
shaped array) would be -
def fully_vectorized(a,b):
return np.tril(np.einsum('ijk,jil->kl',a,b),-1).sum(1)
2) Next trick/idea was to keep one loop for the iterator i
in for i in range(N)
, but insert that dependency with indexing and using np.einsum
to perform all those multiplications and summations. The advantage would be memory-efficiency. The implementation would look like this -
def einsum_oneloop(a,b):
d = np.zeros(N)
for i in range(N):
d[i] = np.einsum('ij,jik->',a[:,:,i],b[:,:,np.arange(i)])
return d
There are two more obvious ways to solve it. So, if we start working from the original all_loopy
solution, one could keep the outer two loops and use np.einsum
or np.tensordot
to perform those operations and thus remove the inner two loops, like so -
def tensordot_twoloop(a,b):
d = np.zeros(N)
for i in range(N):
for j in range(i):
d[i] += np.tensordot(a[:,:,i],b[:,:,j], axes=([1,0],[0,1]))
return d
def einsum_twoloop(a,b):
d = np.zeros(N)
for i in range(N):
for j in range(i):
d[i] += np.einsum('ij,ji->',a[:,:,i],b[:,:,j])
return d
Runtime test
Let's compare all the five approaches posted thus far to solve the problem, including the one posted in the question.
Case #1 :
In [26]: # Input arrays with random elements
...: m,n,N = 20,20,20
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:
In [27]: %timeit all_loopy(a,b)
...: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
10 loops, best of 3: 79.6 ms per loop
100 loops, best of 3: 4.97 ms per loop
1000 loops, best of 3: 1.66 ms per loop
1000 loops, best of 3: 585 µs per loop
1000 loops, best of 3: 684 µs per loop
Case #2 :
In [28]: # Input arrays with random elements
...: m,n,N = 50,50,50
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:
In [29]: %timeit all_loopy(a,b)
...: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
1 loops, best of 3: 3.1 s per loop
10 loops, best of 3: 54.1 ms per loop
10 loops, best of 3: 26.2 ms per loop
10 loops, best of 3: 27 ms per loop
10 loops, best of 3: 23.3 ms per loop
Case #3 (Leaving out all_loopy for being very slow) :
In [30]: # Input arrays with random elements
...: m,n,N = 100,100,100
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:
In [31]: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
1 loops, best of 3: 1.08 s per loop
1 loops, best of 3: 744 ms per loop
1 loops, best of 3: 568 ms per loop
1 loops, best of 3: 866 ms per loop
Going by the numbers, einsum_oneloop
looks pretty good to me, whereas fully_vectorized
could be used when dealing with small to decent sized arrays!
这篇关于矩阵乘法与迭代器的依赖 - numpy的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!