加快矩阵乘法 [英] Speed up matrix multiplication

查看:244
本文介绍了加快矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用这个简单的算法进行矩阵乘法。要我用对象进行动态地包含创建的阵列的基质中更加灵活。

I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.

此相比解决我的第一个静态数组是慢4倍。我能做些什么,以加快数据访问?我并不想改变的算法。

Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.

 matrix mult_std(matrix a, matrix b) {
 matrix c(a.dim(), false, false);
 for (int i = 0; i < a.dim(); i++)
  for (int j = 0; j < a.dim(); j++) {
   int sum = 0;
   for (int k = 0; k < a.dim(); k++)
    sum += a(i,k) * b(k,j);
   c(i,j) = sum;
  }

 return c;
}



修改


我纠正我的问题avove 我加满源$ C ​​$ C以下,尝试了一些你的建议的:


EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:


  • K Ĵ循环迭代 - >性能改进

  • 声明暗淡()运算符()()在线 - >性能改进

  • 传递参数以const参考! - > 性能损失为什么?所以我不使用它。

  • swapped k and j loop iterations -> performance improvement
  • declared dim() and operator()() as inline -> performance improvement
  • passing arguments by const reference -> performance loss! why? so I don't use it.

性能是现在的近,因为它是在老porgram相同。也许应该有更多的改进。

The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.

但我有一个问题:我在函数得到一个内存错误 mult_strassen(...)。为什么?结果
终止扔'则为std :: bad_alloc的一个实例后调用结果
什么()则为std :: bad_alloc



旧程序的结果
main.c中 http://pastebin.com/qPgDWGpW

But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc


OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW

C99的main.c -o矩阵-O3



新方案的结果
matrix.h http://pastebin.com/TYFYCTY7 结果
matrix.cpp http://pastebin.com/wYADLJ8Y 结果
http://pastebin.com/48BSqGJr 结果

G ++的main.cpp m​​atrix.cpp -o矩阵-O3



修改


这里有一些结果。标准算法(STD)的比较,交换J和K环(交换)的秩序,堵塞algortihm块尺寸是13(块)。


EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block).

推荐答案

路过常量引用参数入手:

Pass the parameters by const reference to start with:

matrix mult_std(matrix const& a, matrix const& b) {

要给你更多的细节,我们需要知道所使用的其他方法的细节。结果
并回答了为什么原来的方法快4倍,我们需要看到原来的方法。

To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.

问题无疑是你为这个问题之前已经无数次得到解决。

The problem is undoubtedly yours as this problem has been solved a million times before.

也问这样的问题始终提供编译适当的输入源的时候,所以我们实际上可以生成并运行code ,看看发生了什么。

Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.

如果没有code,我们只是猜测。

Without the code we are just guessing.

在原来的C code固定的主要错误后(一个缓冲器超限运行)

After fixing the main bug in the original C code (a buffer over-run)

我有更新code在一个公平的比较运行并排测试端:

I have update the code to run the test side by side in a fair comparison:

 // INCLUDES -------------------------------------------------------------------
 #include <stdlib.h>
 #include <stdio.h>
 #include <sys/time.h>
 #include <time.h>

 // DEFINES -------------------------------------------------------------------
 // The original problem was here. The MAXDIM was 500. But we were using arrays
 // that had a size of 512 in each dimension. This caused a buffer overrun that
 // the dim variable and caused it to be reset to 0. The result of this was causing
 // the multiplication loop to fall out before it had finished (as the loop was
 // controlled by this global variable.
 //
 // Everything now uses the MAXDIM variable directly.
 // This of course gives the C code an advantage as the compiler can optimize the
 // loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
 #define MAXDIM 512
 #define RUNS 10

 // MATRIX FUNCTIONS ----------------------------------------------------------
 class matrix
 {
 public:
 matrix(int dim)
       : dim_(dim)
 {
         data_ = new int[dim_ * dim_];

 }

     inline int dim() const {
                         return dim_;
                 }
                 inline int& operator()(unsigned row, unsigned col) {
                         return data_[dim_*row + col];
                 }

                 inline int operator()(unsigned row, unsigned col) const {
                         return data_[dim_*row + col];
                 }

 private:
     int dim_;
     int* data_;
 };

// ---------------------------------------------------
 void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
         for (int r = 0; r < MAXDIM; r++)
                 for (int c = 0; c < MAXDIM; c++)
                         matrix[r][c] = rand() % 100;
 }
 void random_matrix_class(matrix& matrix) {
         for (int r = 0; r < matrix.dim(); r++)
                 for (int c = 0; c < matrix.dim(); c++)
                         matrix(r, c) = rand() % 100;
 }

 template<typename T, typename M>
 float run(T f, M const& a, M const& b, M& c)
 {
         float time = 0;

         for (int i = 0; i < RUNS; i++) {
                 struct timeval start, end;
                 gettimeofday(&start, NULL);
                 f(a,b,c);
                 gettimeofday(&end, NULL);

                 long s = start.tv_sec * 1000 + start.tv_usec / 1000;
                 long e = end.tv_sec * 1000 + end.tv_usec / 1000;

                 time += e - s;
         }
         return time / RUNS;
 }
 // SEQ MULTIPLICATION ----------------------------------------------------------
  int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
          for (int r = 0; r < MAXDIM; r++) {
                  for (int c = 0; c < MAXDIM; c++) {
                          z[r][c] = 0;
                          for (int i = 0; i < MAXDIM; i++)
                                  z[r][c] += a[r][i] * b[i][c];
                  }
          }
  }
  void mult_std(matrix const& a, matrix const& b, matrix& z) {
          for (int r = 0; r < a.dim(); r++) {
                  for (int c = 0; c < a.dim(); c++) {
                          z(r,c) = 0;
                          for (int i = 0; i < a.dim(); i++)
                                  z(r,c) += a(r,i) * b(i,c);
                  }
          }
  }

  // MAIN ------------------------------------------------------------------------
  using namespace std;
  int main(int argc, char* argv[]) {
          srand(time(NULL));

          int matrix_a[MAXDIM][MAXDIM];
          int matrix_b[MAXDIM][MAXDIM];
          int matrix_c[MAXDIM][MAXDIM];
          random_matrix(matrix_a);
          random_matrix(matrix_b);
          printf("%d ", MAXDIM);
          printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));

          matrix a(MAXDIM);
          matrix b(MAXDIM);
          matrix c(MAXDIM);
          random_matrix_class(a);
          random_matrix_class(b);
          printf("%d ", MAXDIM);
          printf("%f \n", run(mult_std, a, b, c));

          return 0;
  }

现在的结果:

$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000

$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000

从这里我们看到了C code为约两倍的C ++ code时全面优化。我看不出在code中的原因。

From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.

这篇关于加快矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆