稀疏矩阵的numpy元素外部乘积 [英] numpy elementwise outer product with sparse matrices

查看:223
本文介绍了稀疏矩阵的numpy元素外部乘积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在python中将三个(或四个)大型2D数组进行元素逐个外部乘积(值将float32舍入为两位小数).它们都具有相同数量的行"n",但是具有不同数量的列"i","j","k".
所得数组的形状应为(n,i * j * k).然后,我想对结果的每一列求和,最后得到一维形状为(i * j * k)的一维数组.

I want to do the element-wise outer product of three (or four) large 2D arrays in python (values are float32 rounded to 2 decimals). They all have the same number of rows "n", but different number of columns "i", "j", "k".
The resulting array should be of shape (n, i*j*k). Then, I want to sum each column of the result to end up with a 1D array of shape (i*j*k).

np.shape(a) = (75466, 10)
np.shape(b) = (75466, 28)
np.shape(c) = (75466, 66)

np.shape(intermediate_result) = (75466, 18480)
np.shape(result) = (18480)

由于 ruankesi和divakar ,我得到了一段有效的代码:

Thanks to ruankesi and divakar, I got a piece of code that works:

# Multiply first two matrices
first_multi = a[...,None] * b[:,None]
# could use np.einsum('ij,ik->ijk',a,b), which is slightly faster
ab_fills = first_multi.reshape(a.shape[0], a.shape[1]*b.shape[1])

# Multiply the result with the third matrix
second_multi = ab_fills[..., None] * c[:,None]
abc_fills = second_multi.reshape(ab_fills.shape[0], ab_fills.shape[1] * c.shape[1])

# Get the result: sum columns and get a 1D array of length 10*28*66 = 18 480
result = np.sum(abc_fills, axis = 0)

问题1 :性能

这大约需要3秒钟,但是我必须重复多次此操作,并且某些矩阵甚至更大(以行数计).这是可以接受的,但是使其更快将是不错的.

Problem 1: Performance

This takes about 3 seconds, but I have to repeat this operation many times and some of the matrices are even larger (in number of rows). It is acceptable but making it faster would be nice.

例如,"a"实际上包含0的70%.我尝试使用scipy csc_matrix,但实际上无法获得工作版本. (要在此处获得按元素分类的外部乘积,我需要转换为3D矩阵,而scipy sparse_matrix不支持该矩阵)

Indeed, for instance, "a" contains 70% of 0s. I tried to play with scipy csc_matrix, but really could not get a working version. (to get the element-wise outer product here I go via a conversion to a 3D matrix, which are not supported in scipy sparse_matrix)

如果我也尝试使用第4个矩阵,则会遇到内存问题.


我想将这段代码转换为sparse_matrix可以节省大量内存,并且可以通过忽略众多0值来加快计算速度. 真的吗?如果是,有人可以帮我吗?
当然,如果您有更好的实施建议,我也很感兴趣.我不需要任何中间结果,仅需要最终的一维结果.
我停留在这部分代码上已经有几个星期了,我真疯了!

谢谢!



If I try to also work with a 4th matrix, I run into memory issues.


I imagine that converting this code to sparse_matrix would save a lot of memory, and make the calculation faster by ignoring the numerous 0 values. Is that true? If yes, can someone help me?
Of course, if you have any suggestion for a better implementation, I am also very interested. I don't need any of the intermediate results, just the final 1D result.
It's been weeks I'm stuck on this part of code, I am going nuts!

Thank you!



方法1:
一种很好的衬板,但出人意料地比原始方法要慢(?).
在我的测试数据集上,方法#1每个循环耗时4.98 s±3.06 ms(optimize = True时无加速)
最初的分解方法每个循环耗时3.01 s±16.5 ms

Approach #1:
Very nice one liner but surprisingly slower than the original approach (?).
On my test dataset, approach #1 takes 4.98 s ± 3.06 ms per loop (no speedup with optimize = True)
The original decomposed approach took 3.01 s ± 16.5 ms per loop


方法2:
太好了,谢谢!多么令人印象深刻的加速!
每个循环62.6 ms±233 µs


Approach #2:
Absolutely great, thank you! What an impressive speedup!
62.6 ms ± 233 µs per loop


关于numexpr,我尽量避免对外部模块的需求,并且我不打算使用多核/线程.这是一个令人费解的"可并行化任务,需要分析成千上万个对象,我将在生产过程中将列表分散到所有可用的CPU上.我将尝试进行内存优化.
作为对numexpr的简短尝试,其中限制了1个线程,执行1个乘法,没有numexpr的运行时为40毫秒,而使用numexpr的运行时为52毫秒.
再次感谢!


About numexpr, I try to avoid as much as possible requirements for external modules, and I don't plan to use multicores/threads. This is an "embarrassingly" parallelizable task, with hundreds of thousands of objects to analyze, I'll just spread the list across available CPUs during production. I will give it a try for memory optimization.
As a brief try of numexpr with a restriction for 1 thread, performing 1 multiplication, I get a runtime of 40ms without numexpr, and 52 ms with numexpr.
Thanks again!!

推荐答案

方法1

我们可以使用 np.einsum 来一口气减少总和-

We can use np.einsum to do sum-reductions in one go -

result = np.einsum('ij,ik,il->jkl',a,b,c).ravel()

此外,将np.einsum中的optimize标志设置为True,以使用BLAS.

Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.

方法2

我们可以使用broadcasting进行第一步,正如发布的代码中所述,然后将tensor-matrix-multiplcation与np.tensordot-

We can use broadcasting to do the first step as also mentioned in the posted code and then leverage tensor-matrix-multiplcation with np.tensordot -

def broadcast_dot(a,b,c):
    first_multi = a[...,None] * b[:,None]
    return np.tensordot(first_multi,c, axes=(0,0)).ravel()

我们还可以使用 numexpr模块支持多核处理,并且获得first_multi的内存效率更高.这为我们提供了一种经过修改的解决方案,例如-

We can also use numexpr module that supports multi-core processing and also achieves better memory efficiency to get first_multi. This gives us a modified solution, like so -

import numexpr as ne

def numexpr_broadcast_dot(a,b,c):
    first_multi = ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
    return np.tensordot(first_multi,c, axes=(0,0)).ravel()


具有给定数据集大小的随机浮动数据的计时-


Timings on random float data with given dataset sizes -

In [36]: %timeit np.einsum('ij,ik,il->jkl',a,b,c).ravel()
4.57 s ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit broadcast_dot(a,b,c)
270 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit numexpr_broadcast_dot(a,b,c)
172 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

只是为了给numexpr提供改善感-

Just to give a sense of improvement with numexpr -

In [7]: %timeit a[...,None] * b[:,None]
80.4 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
25.9 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

当将此解决方案扩展到更多输入时,这应该是实质性的.

This should be substantial when extending this solution to higher number of inputs.

这篇关于稀疏矩阵的numpy元素外部乘积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆