采用开放式MP慢稀疏矩阵向量积(CSR) [英] slow sparse matrix vector product (CSR) using open mp

查看:222
本文介绍了采用开放式MP慢稀疏矩阵向量积(CSR)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用开放的MP加快稀疏矩阵向量的产品,code是如下:

I am trying to speed up a sparse matrix-vector product using open mp, the code is as follows:

void zAx(double * z, double * data, long * colind, long * row_ptr, double * x, int M){

long i, j, ckey;
int chunk = 1000;
//int * counts[8]={0};
#pragma omp parallel num_threads(8)
{ 
  #pragma omp for private(ckey,j,i) schedule(static,chunk)
  for (i=0; i<M; i++ ){ 
    z[i]=0;
    for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
      j = colind[ckey];
      z[i] += data[ckey]*x[j];
    }              
  }
}
}

现在,这个code运行正常,并产生正确的结果,但它只是给了我〜30%的速度了。我已检查地看到,线程都获得有关相同数量的非零元素(它们),而矩阵是相当大(300,000点¯x300,000),所以我希望的开销是不是唯一的问题。我也试着用不同的块大小和线程数运行,我也得到类似的性能。

Now, this code runs fine, and produces the correct result, it however only gives me a speed up of ~30%. I have checked to see that the threads are all getting about the same number of non-zero elements (they are), and the matrix is fairly large (300,000 x 300,000), so I'm hoping the overhead isn't the only issue. I've also tried running with different chunk sizes and thread numbers, and I get similar performance.

有没有别的东西,我可以尝试获得一点额外的速度出来的呢?还是什么我明明做错了吗?

Is there something else I could try to get a bit of extra speed out of this? Or something I'm obviously doing wrong?

干杯。

编辑:刚才注释掉'//为int *计数[8] = {0},因为它是从统计的工作分配剩余。不需要

Just commented out '//int * counts[8]={0}', because it was leftover from counting the work allocation. Not needed

EDIT2(详细信息):

Edit2 (more details):

好了,所以我定时调用该5000次的循环,并获得平均时间:

Okay so I timed a loop of calling this 5000 times and got the average times:


  • 序列:0.0036(秒?)

  • 2个线程:0.002613

  • 4个线程:0.002308

  • 8线程:0.002384

该矩阵大小: 303544x303544 并具有: 2122980 非零元素

The matrix is of size: 303544x303544 and has: 2122980 non-zero elements.

通过一个更小的矩阵30000x30000我得到倍去更像

With a much smaller matrix 30000x30000 I get times that go more like


  • 序列0.000303

  • 8线程0.000078

如此看来大尺寸可能是我的问题。

So it seems the large size may be my issue.

推荐答案

欢迎到的内存限制问题的奇妙世界。从你的痛苦减轻了,我想告诉你,那稀疏矩阵向量乘是不能有效地并行化,甚至矢量化一个多核心芯片上的许多事情之一,除非的所有数据可能适合在最后一级缓存或内存总线的真的很宽

Welcome to the wonderful world of memory-bound problems. To relieve you from your pain, I would like to inform you, that sparse matrix-vector multiplication is one of the many things that cannot be effectively parallelised or even vectorised on a single multi-core chip, unless all data could fit in the last level cache or the memory bus is really really wide.

为什么呢?仅仅是因为计算的存储器存取的比例非常低。对于内部循环的每个迭代你取一次列索引Ĵ(8字节),矩阵元素插入数据(8字节),载体元件(8字节)和结果的previous值(因为编译器很少优化访问共享变量)(8字节)的值。然后,您执行2非常快的浮点运算(FLOPS),并进行存储(虽然 + = 运营商被翻译成一个单一的指令,它仍然是一个读取 - 修改-write一节)。总共装载32个字节,你对他们进行2 FLOPS。这使得每字节1/16 FLOPS。

Why? Simply because the ratio of computation to memory access is extremely low. For each iteration of the inner loop you fetch once the column index into j (8 bytes), the matrix element into data (8 bytes), the value of the vector element (8 bytes) and the previous value of the result (since compilers rarely optimise access to shared variables) (8 bytes). Then you perform 2 very fast floating point operations (FLOPs) and perform a store (although the += operator is translated to a single instruction, it is still a "fetch-modify-write" one). In total you load 32 bytes and you perform 2 FLOPs over them. This makes 1/16 FLOPs per byte.

一个现代化的 SSE功能的CPU内核可以执行4双precision FLOPS /循环,其结果往往是喜欢的东西每个CPU核心8 GFLOPS(假设2千兆赫的基本频率)。随着AVX数量翻倍,让你获得高达每核16 GFLOPS在2 GHz的英特尔Sandy / Ivy Bridge的或AMD同等。为了与饱和这个数据处理能力,考虑到1/16 FLOPS /字节,你需要的内存带宽至少为128吉布/秒。

A modern SSE-capable CPU core can perform 4 double-precision FLOPs/cycle, which often results to something like 8 GFLOPS per CPU core (assuming basic frequency of 2 GHz). With AVX the number is doubled, so you get up to 16 GFLOPS per core on a 2 GHz Intel Sandy/Ivy Bridge or AMD equivalent. In order to saturate this processing power with data, given the 1/16 FLOPs/byte, you would need at least 128 GiB/s of memory bandwidth.

A 高端的Nehalem-EX处理器像至强X7560在2.26千兆赫(9,04 GFLOPS /核心)和它共享的L3缓存运行(L1和L2缓存每核心)提供约275吉布/秒。在9,04 GFLOPS /核心,你需要144,64吉布每个核心/秒,以养活你的 ZAX 程序的内部循环。这意味着,在理想的情况下,这款CPU的三级缓存不能养活超过2完全矢量化乘法内核。

A high-end Nehalem-EX processor like Xeon X7560 runs at 2,26 GHz (9,04 GFLOPS/core) and its shared L3 cache (L1 and L2 caches are per-core) delivers about 275 GiB/s. At 9,04 GFLOPS/core, you need 144,64 GiB/s per core in order to feed the inner loop of your zAx routine. This means that in the ideal case, the L3 cache of this CPU cannot feed more than 2 fully vectorised multiplication kernels.

如果没有矢量化上交所的FLOPS率的两倍双precision低,所以人们可以期待这个问题扩展到4个线程。事情变得极坏,一旦你的问题变得比L3缓存更大的内存总线提供了十倍的带宽更少。

Without SSE vectorisation the FLOPS rate is twice as lower for double precision, so one could expect the problem to scale up to 4 threads. Things get extremely bad once your problem becomes larger than the L3 cache as the memory bus delivers about ten times less bandwidth.

尝试内环以下版本只是为了看看如果编译器是足够聪明,遵循的OpenMP轻松的内存视图:

Try the following version of the inner loop just to see if the compiler is smart enough to follow the relaxed memory view of OpenMP:

#pragma omp for private(ckey,j) schedule(static,chunk)
for (i=0; i<M; i++){
  double zi = 0.0;
  for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
    j = colind[ckey];
    zi += data[ckey]*x[j];
  }
  z[i] = zi;
}

不幸的是,仅此而已,你可以做。稀疏矩阵矢量乘法与CPU插槽的数目扩展,而不是与CPU内核的数量。你需要一个多插槽系统,独立的内存控制器,例如拥有多台(后)Nehalem处理器或AMD64处理器的系统。

Unfortunately there is nothing more that you can do. Sparse matrix-vector multiplication scales with the number of CPU sockets, not with the number of CPU cores. You would need a multi-socket system with separate memory controllers, e.g. any system with more than one (post-)Nehalem or AMD64 processors.

编辑:优化提示。你真的需要来存储列索引和行指针?随着2122980非零元素你会使用 INT 而不是pretty罚款。这将节省装载每个单元每行4个字节的内部循环和另外4个字节的外环。

optimisation hint. Do you really need long to store the column index and the row pointers? With 2122980 non-zero elements you would be pretty fine using int instead. It would save loading 4 bytes per element in the inner loop and another 4 bytes per row in the outer loop.

这篇关于采用开放式MP慢稀疏矩阵向量积(CSR)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆