矩阵乘法,为了KIJ,水货版本比非平行慢 [英] Matrix multiplication, KIJ order, Parallel version slower than non-parallel

查看:126
本文介绍了矩阵乘法,为了KIJ,水货版本比非平行慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对paralel编程学校的任务,我有很多与它的问题。
我的任务是创建给定的矩阵乘法code的并行版本并测试其性能自动(是的,它必须是在KIJ顺序排列):

 无效multiply_matrices_KIJ()
{
    对于(INT K = 0; K<尺寸; k ++)
        的for(int i = 0; I<大小;我++)
            对于(INT J = 0; J<大小; J ++)
                matrix_r [I] [J] + = matrix_a [I] [K] * matrix_b [K] [J]。
}

这是我想出迄今:

 无效multiply_matrices_KIJ()
{
    对于(INT K = 0; K<尺寸; k ++)
OMP的#pragma并行
    {
OMP的#pragma附表(静态,16)
        的for(int i = 0; I<大小;我++)
            对于(INT J = 0; J<大小; J ++)
                matrix_r [I] [J] + = matrix_a [I] [K] * matrix_b [K] [J]。
    }
}

这就是在那里我发现了一些令人困惑了我。的code的这个平行版本正在运行比非平行1慢50%左右。在速度上的差异而变化仅基于矩阵尺寸一点点(测试尺寸= 128,256,512,1024,2048,以及各种时间表版本 - 动态,静态,W / O型它在所有等迄今)<。 / p>

有人可以帮助我了解我究竟做错了什么?它是也许是因为我使用的是KIJ秩序,它不会得到任何更快的使用OpenMP?

修改

我工作在Windows 7 PC上,使用Visual Studio 2015年社区版,86版模式下编译(64无助要么)。我的CPU是:英特尔酷睿i5-2520M的CPU @ 2,50GHZ(是的,是的,它是一台笔记本电脑,但我发现我的家I7电脑上相同的结果)

我使用全局阵列:

 浮动matrix_a [SIZE] [SIZE]
浮matrix_b [SIZE] [SIZE]
浮matrix_r [SIZE] [SIZE]

我指定随机(浮点)值矩阵A和B,矩阵R充满了0。

我测试过的code与各种矩阵大小到目前为止(128,256,512,1024,2048等)。对于他们中的一些是打算不适合在高速缓存中。
我现在的code版本是这样的:

 无效multiply_matrices_KIJ()
{
OMP的#pragma并行
    {
    对于(INT K = 0; K&LT;尺寸; k ++){
OMP的#pragma附表(动态,16)NOWAIT
        的for(int i = 0; I&LT;大小;我++){
            对于(INT J = 0; J&LT;大小; J ++){
                matrix_r [I] [J] + = matrix_a [I] [K] * matrix_b [K] [J]。
            }
        }
    }
    }
}

和仅仅是明确的,我知道,与不同的排序环路我能取得更好的成绩,但是这就是事情 - 我必须使用KIJ秩序。我的任务是做KIJ在并行循环,并检查性能自动增加。我的问题是,我希望(ED)至少快一点的执行(一个比即时得到现在%,其中5-10之间最多更快),即使它是我的循环,并行(不能做到这一点与ķ循环,因为我会得到不正确的结果,因为它是matrix_r [I] [J])。

这是使用上面显示的code(我做的计算数百次,获得的平均时间),当我得到的结果:

尺寸= 128


  • 串行版本:0,000608s

  • 并行I,日程(动态,16):0,000683s

  • 并行I,时间表(静态,16):0,000647s

  • 平行Ĵ,没有时间表:0,001978s(这是我exected
    方式执行慢)

尺寸= 256


  • 串行版本:0,005787s

  • 并行I,日程(动态,16):0,005125s

  • 并行I,时间表(静态,16):0,004938s

  • 平行Ĵ,没有时间表:0,013916s

尺寸= 1024


  • 串行版本:0,930250s

  • 并行I,日程(动态,16):0,865750s

  • 并行I,时间表(静态,16):0,823750s

  • 平行Ĵ,没有时间表:1,137000s


解决方案

注:这个答案是不是关于如何获得最佳的表现出来你的循环顺序或如何并行,因为我认为这是次优由于几个原因。我会努力就如何提高订单一些建议(和并行化)来代替。

循环顺序

OpenMP是通常用来在若干CPU分配工作。因此,要最大限度地提高每个线程的工作量,同时最小化所需要的数据和信息传输的量


  1. 您要并行执行,而不是第二个最外层循环。因此,你需要有 r_matrix 索引与外环指标之一,以避免竞争条件写入结果矩阵时。


  2. 接下来的事情是,你要遍历存储器顺序矩阵(具有更快的改变指数作为第二个不是第一个下标索引)。


您可以用下面的循环/索引顺序实现双方:

 对于i = 0至a_rows
  对于k = 0到a_cols
    对于j = 0到b_cols
       - [R [I] [J] = a [i] [K] * B [k]的[J]。

其中,


  • Ĵ的变化比更快的I K K 的变化比更快的I

  • I 是一个结果矩阵标和 I 循环可以并行运行

以这种方式重新排列 multiply_matrices_KIJ 给出了不少性能提升的了。

我做我用来比较的时序一些短期测试和code为:

 模板&LT;类T&GT;
无效mm_kij(T const的* const的matrix_a,性病::为size_t常量a_rows,
  的std ::为size_t常量a_cols,T常量* const的matrix_b,性病::为size_t常量b_rows,
  的std ::为size_t常量b_cols,T * const的matrix_r)
{
  对于(的std ::为size_t K = 0; K&LT; a_cols; k ++)
  {
    对于(的std ::为size_t我= 0; I&LT; a_rows;我++)
    {
      对于(的std ::为size_t J = 0; J&LT; b_cols; J ++)
      {
        matrix_r [我* b_cols + J] + =
          matrix_a [我* a_cols + K] * matrix_b [K * b_cols + J]。
      }
    }
  }
}

mimicing你的 multiply_matrices_KIJ()功能对比

 模板&LT;类T&GT;
无效mm_opt(T const的* const的a_matrix,性病::为size_t常量a_rows,
  的std ::为size_t常量a_cols,T常量* const的b_matrix,性病::为size_t常量b_rows,
  的std ::为size_t常量b_cols,T * const的r_matrix)
{
  对于(的std ::为size_t我= 0; I&LT; a_rows ++ I)
  {
    T * const的r_row_p = r_matrix + I * b_cols;
    对于(的std ::为size_t K = 0; K&LT; a_cols ++ K)
    {
      汽车常量a_val = a_matrix [我* a_cols + K]。
      ŧ常量* const的b_row_p = b_matrix + K * b_cols;
      对于(的std ::为size_t J = 0; J&LT; b_cols ++ j)条
      {
        r_row_p [J] + = a_val * b_row_p [J]。
      }
    }
  }
}

实施上述顺序。


  

有两个倍增时间消耗 2048×2048 英特尔i5-2500k矩阵


  
  

      
  • mm_kij():6.16706s


  •   
  • mm_opt():2.6567s


  •   

给定的顺序也使得外环并行不引入任何竞争条件写入结果矩阵时:

 模板&LT;类T&GT;
无效mm_opt_par(T const的* const的a_matrix,性病::为size_t常量a_rows,
  的std ::为size_t常量a_cols,T常量* const的b_matrix,性病::为size_t常量b_rows,
  的std ::为size_t常量b_cols,T * const的r_matrix)
{
#如果定义(_OPENMP)
  OMP的#pragma并行
  {
    汽车AR =的static_cast&LT;的std :: ptrdiff_t的&GT;(a_rows);
    OMP的#pragma附表(静态)NOWAIT
    对于(的std :: ptrdiff_t的I = 0; I&LT; AR; ++ I)
#其他
    对于(的std ::为size_t我= 0; I&LT; a_rows ++ I)
#万一
    {
      T * const的r_row_p = r_matrix + I * b_cols;
      对于(的std ::为size_t K = 0; K&LT; b_rows ++ K)
      {
        汽车常量a_val = a_matrix [我* a_cols + K]。
        ŧ常量* const的b_row_p = b_matrix + K * b_cols;
        对于(的std ::为size_t J = 0; J&LT; b_cols ++ j)条
        {
          r_row_p [J] + = a_val * b_row_p [J]。
        }
      }
    }
#如果定义(_OPENMP)
  }
#万一
}

如果每个线程写入到一个单独的结果行


  

有两个倍增时间消耗 2048×2048 英特尔i5-2500k矩阵(4 OMP线程)


  
  

      
  • mm_kij():6.16706s


  •   
  • mm_opt():2.6567s


  •   
  • mm_opt_par():0.968325s


  •   

虽然不完美,但比例作为开始比串行code更快。

I have a school task about paralel programming and I'm having a lot of problems with it. My task is to create a parallel version of given matrix multiplication code and test its performence (and yes, it has to be in KIJ order):

void multiply_matrices_KIJ()
{
    for (int k = 0; k < SIZE; k++)
        for (int i = 0; i < SIZE; i++)
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}

This is what I came up with so far:

void multiply_matrices_KIJ()
{
    for (int k = 0; k < SIZE; k++)
#pragma omp parallel
    {
#pragma omp for schedule(static, 16)
        for (int i = 0; i < SIZE; i++)
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
    }
}

And that's where i found something confusing to me. This parallel version of the code is running around 50% slower than non-parallel one. The difference in speed varies only a little bit based on the matrix size (tested SIZE = 128, 256, 512, 1024, 2048, and various schedule versions - dynamic, static, w/o it at all etc. so far).

Can someone help me understand what am I doing wrong? Is it maybe because I'm using the KIJ order and it won't get any faster using openMP?

EDIT:

I'm working on a Windows 7 PC, using Visual Studio 2015 Community edition, compiling in Release x86 mode (x64 doesn't help either). My CPU is: Intel Core i5-2520M CPU @ 2,50GHZ (yes, yes it's a laptop, but I'm getting same results on my home I7 PC)

I'm using global arrays:

float matrix_a[SIZE][SIZE];    
float matrix_b[SIZE][SIZE];    
float matrix_r[SIZE][SIZE];

I'm assigning random (float) values to matrix a and b, matrix r is filled with 0s.

I've tested the code with various matrix sizes so far (128, 256, 512, 1024, 2048 etc.). For some of them it is intended NOT to fit in cache. My current version of code looks like this:

void multiply_matrices_KIJ()
{
#pragma omp parallel 
    {
    for (int k = 0; k < SIZE; k++) {
#pragma omp for schedule(dynamic, 16) nowait
        for (int i = 0; i < SIZE; i++) {
            for (int j = 0; j < SIZE; j++) {
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
            }
        }
    }
    }
}

And just to be clear, I know that with different ordering of loops I could get better results but that is the thing - I HAVE TO use KIJ order. My task is to do the KIJ for loops in parallel and check the performence increase. My problem is that I expect(ed) at least a little faster execution (than the one im getting now which it between 5-10% faster at most) even though it's the I loop that is in parallel (can't do that with K loop because I will get incorrect result since it's matrix_r[i][j]).

These are the results I'm getting when using the code shown above (I'm doing calculations hundreds of times and getting the average time):

SIZE = 128

  • Serial version : 0,000608s
  • Parallel I, schedule(dynamic, 16): 0,000683s
  • Parallel I, schedule(static, 16): 0,000647s
  • Parallel J, no schedule: 0,001978s (this is where I exected way slower execution)

SIZE = 256

  • Serial version: 0,005787s
  • Parallel I, schedule(dynamic, 16): 0,005125s
  • Parallel I, schedule(static, 16): 0,004938s
  • Parallel J, no schedule: 0,013916s

SIZE = 1024

  • Serial version: 0,930250s
  • Parallel I, schedule(dynamic, 16): 0,865750s
  • Parallel I, schedule(static, 16): 0,823750s
  • Parallel J, no schedule: 1,137000s

解决方案

Note: This answer is not about how to get the best performance out of your loop order or how to parallelize it because I consider it to be suboptimal due to several reasons. I'll try to give some advice on how to improve the order (and parallelize it) instead.

Loop order

OpenMP is usually used to distribute work over several CPUs. Therefore, you want to maximize the workload of each thread while minimizing the amount of required data and information transfer.

  1. You want to execute the outermost loop in parallel instead of the second one. Therefore, you'll want to have one of the r_matrix indices as outer loop index in order to avoid race conditions when writing to the result matrix.

  2. The next thing is that you want to traverse the matrices in memory storage order (having the faster changing indices as the second not the first subscript index).

You can achieve both with the following loop/index order:

for i = 0 to a_rows
  for k = 0 to a_cols
    for j = 0 to b_cols
      r[i][j] = a[i][k]*b[k][j]

Where

  • j changes faster than i or k and k changes faster than i.
  • i is a result matrix subscript and the i loop can run parallel

Rearranging your multiply_matrices_KIJ in that way gives quite a bit of a performance boost already.

I did some short tests and the code I used to compare the timings is:

template<class T>
void mm_kij(T const * const matrix_a, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const matrix_b, std::size_t const b_rows, 
  std::size_t const b_cols, T * const matrix_r)
{
  for (std::size_t k = 0; k < a_cols; k++)
  {
    for (std::size_t i = 0; i < a_rows; i++)
    {
      for (std::size_t j = 0; j < b_cols; j++)
      {
        matrix_r[i*b_cols + j] += 
          matrix_a[i*a_cols + k] * matrix_b[k*b_cols + j];
      }
    }
  }
}

mimicing your multiply_matrices_KIJ() function versus

template<class T>
void mm_opt(T const * const a_matrix, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows, 
  std::size_t const b_cols, T * const r_matrix)
{
  for (std::size_t i = 0; i < a_rows; ++i)
  { 
    T * const r_row_p = r_matrix + i*b_cols;
    for (std::size_t k = 0; k < a_cols; ++k)
    { 
      auto const a_val = a_matrix[i*a_cols + k];
      T const * const b_row_p = b_matrix + k * b_cols;
      for (std::size_t j = 0; j < b_cols; ++j)
      { 
        r_row_p[j] += a_val * b_row_p[j];
      }
    }
  }
}

implementing the above mentioned order.

Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k

  • mm_kij(): 6.16706s.

  • mm_opt(): 2.6567s.

The given order also allows for outer loop parallelization without introducing any race conditions when writing to the result matrix:

template<class T>
void mm_opt_par(T const * const a_matrix, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows, 
  std::size_t const b_cols, T * const r_matrix)
{
#if defined(_OPENMP)
  #pragma omp parallel
  {
    auto ar = static_cast<std::ptrdiff_t>(a_rows);
    #pragma omp for schedule(static) nowait
    for (std::ptrdiff_t i = 0; i < ar; ++i)
#else
    for (std::size_t i = 0; i < a_rows; ++i)
#endif
    {
      T * const r_row_p = r_matrix + i*b_cols;
      for (std::size_t k = 0; k < b_rows; ++k)
      {
        auto const a_val = a_matrix[i*a_cols + k];
        T const * const b_row_p = b_matrix + k * b_cols;
        for (std::size_t j = 0; j < b_cols; ++j)
        {
          r_row_p[j] += a_val * b_row_p[j];
        }
      }
    }
#if defined(_OPENMP)
  }
#endif
}

Where each thread writes to an individual result row

Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k (4 OMP threads)

  • mm_kij(): 6.16706s.

  • mm_opt(): 2.6567s.

  • mm_opt_par(): 0.968325s.

Not perfect scaling but as a start faster than the serial code.

这篇关于矩阵乘法,为了KIJ,水货版本比非平行慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆