稀疏矩阵与密集矩阵乘法C ++ Tensorflow [英] Sparse Matrix Vs Dense Matrix Multiplication C++ Tensorflow

查看:191
本文介绍了稀疏矩阵与密集矩阵乘法C ++ Tensorflow的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用C ++ Tensorflow稀疏矩阵密集矢量(SPMv)乘法编写:y = Ax

I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax

稀疏矩阵A以CSR格式存储. A的通常稀疏度在50-90%之间.目标是达到比密集矩阵密集向量(DMv)乘积更好的时间或类似的时间.

The sparse matrix, A, is stored in CSR format. The usual sparsity of A is between 50-90%. The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.

请注意,我已经查看了以下帖子: Q2

Please note that I have already viewed the following posts: Q1 Q2 Q3. However, I still am wondering about the following:

  1. SPMv乘法在时间上与DMv相比如何?由于稀疏度较高,因此我认为鉴于操作数量的减少,SPMv应该更好-是吗?
  2. 我应该考虑使SpMv在时间上与DMv相同或更佳的因素?为什么PPL会说DMv比SPMv表现得更好?存储形式是否有所不同?
  3. 任何建议使用C ++执行SPMv的库都可用于CPU或GPU的实现.

这个问题与我在这里的其他问题有关:(

This question is relevant to my other question here: (CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network)

推荐答案

回答已编辑的问题:

  1. 除非矩阵非常稀疏(CPU上<10%的非零,GPU上<1%的非零),否则您可能不会从稀疏中受益.尽管减少了浮点运算的数量,但存储量至少是 double 倍(列或行索引+值),内存访问是不规则的(您可以通过右向的索引进行间接访问,另一方面),向量化(或在GPU上实现合并)变得更加困难,如果并行化,则必须处理行的长度不同的事实,因此静态调度可能次优. li>
  2. 除了以上几点以外,是的,存储表示也很重要.例如,COO矩阵存储两个索引和值,而CSR/CSC仅存储一个索引,但需要附加的偏移数组,这使得它们在运行时更加复杂.尤其是在GPU上,如果您至少希望实现 some 合并,那么存储格式就很重要.本文研究了存储格式如何影响GPU的性能: https://onlinelibrary .wiley.com/doi/full/10.1111/cgf.13957
  3. 对于一般性内容,请尝试本征
  1. Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity. While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.
  2. Beyond the points above, yes, the storage representation matters. For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly. Especially on the GPU, storage formats matter if you want to at least achieve some coalescing. This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
  3. For something generic try Eigen or cuSparse on GPU. There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.

除了矩阵格式本身之外,即使矩阵中的条目顺序也会对性能产生巨大影响,这就是为什么通常使用Cuthill-McKee算法来减少矩阵带宽(从而提高缓存性能)的原因.

Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).

这篇关于稀疏矩阵与密集矩阵乘法C ++ Tensorflow的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆