如何在C ++中加快矩阵乘法? [英] How to speed up matrix multiplication in C++?

查看:97
本文介绍了如何在C ++中加快矩阵乘法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用这种简单的算法执行矩阵乘法.为了更加灵活,我对包含动态创建的数组的矩阵使用了对象.

I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.

将此解决方案与第一个使用静态数组的解决方案相比,它的速度慢了4倍.我该怎么做才能加快数据访问速度?我不想更改算法.

Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.

 matrix mult_std(matrix a, matrix b) {
 matrix c(a.dim(), false, false);
 for (int i = 0; i < a.dim(); i++)
  for (int j = 0; j < a.dim(); j++) {
   int sum = 0;
   for (int k = 0; k < a.dim(); k++)
    sum += a(i,k) * b(k,j);
   c(i,j) = sum;
  }

 return c;
}


编辑
我更正了我的问题!我在下面添加了完整的源代码,并尝试了一些建议:


EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:

  • 交换了kj循环迭代->性能提高
  • dim()operator()()声明为inline->性能改进
  • 通过const引用传递参数-> 性能损失!为什么?所以我不使用它.
  • swapped k and j loop iterations -> performance improvement
  • declared dim() and operator()() as inline -> performance improvement
  • passing arguments by const reference -> performance loss! why? so I don't use it.

现在的性能几乎与旧的porgram相同.也许应该有更多的改进.

The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.

但是我还有另一个问题:函数mult_strassen(...)中出现内存错误.为什么?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc


旧程序
main.c http://pastebin.com/qPgDWGpW

But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc


OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW

c99 main.c -o matrix -O3


新计划
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr

c99 main.c -o matrix -O3


NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr

g++ main.cpp matrix.cpp -o matrix -O3.


编辑
这是一些结果.标准算法(std),j和k循环的交换顺序(交换)和块大小为13(块)的块算法之间的比较.


EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block).

推荐答案

通过const引用传递参数以开始于:

Pass the parameters by const reference to start with:

matrix mult_std(matrix const& a, matrix const& b) {

要向您提供更多详细信息,我们需要了解所用其他方法的详细信息.
为了回答为什么原始方法要快4倍的问题,我们需要查看原始方法.

To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.

这个问题无疑是你的,因为这个问题已经被解决了一百万遍了.

The problem is undoubtedly yours as this problem has been solved a million times before.

另外,当问此类问题时,总是向编译源提供适当的输入,以便我们实际上可以构建并运行代码并查看正在发生的事情.

Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.

没有代码,我们只是在猜测.

Without the code we are just guessing.

修复原始C代码中的主要错误(缓冲区溢出)

After fixing the main bug in the original C code (a buffer over-run)

我已经更新了代码以进行公平的比较并排运行测试:

I have update the code to run the test side by side in a fair comparison:

 // INCLUDES -------------------------------------------------------------------
 #include <stdlib.h>
 #include <stdio.h>
 #include <sys/time.h>
 #include <time.h>

 // DEFINES -------------------------------------------------------------------
 // The original problem was here. The MAXDIM was 500. But we were using arrays
 // that had a size of 512 in each dimension. This caused a buffer overrun that
 // the dim variable and caused it to be reset to 0. The result of this was causing
 // the multiplication loop to fall out before it had finished (as the loop was
 // controlled by this global variable.
 //
 // Everything now uses the MAXDIM variable directly.
 // This of course gives the C code an advantage as the compiler can optimize the
 // loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
 #define MAXDIM 512
 #define RUNS 10

 // MATRIX FUNCTIONS ----------------------------------------------------------
 class matrix
 {
 public:
 matrix(int dim)
       : dim_(dim)
 {
         data_ = new int[dim_ * dim_];

 }

     inline int dim() const {
                         return dim_;
                 }
                 inline int& operator()(unsigned row, unsigned col) {
                         return data_[dim_*row + col];
                 }

                 inline int operator()(unsigned row, unsigned col) const {
                         return data_[dim_*row + col];
                 }

 private:
     int dim_;
     int* data_;
 };

// ---------------------------------------------------
 void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
         for (int r = 0; r < MAXDIM; r++)
                 for (int c = 0; c < MAXDIM; c++)
                         matrix[r][c] = rand() % 100;
 }
 void random_matrix_class(matrix& matrix) {
         for (int r = 0; r < matrix.dim(); r++)
                 for (int c = 0; c < matrix.dim(); c++)
                         matrix(r, c) = rand() % 100;
 }

 template<typename T, typename M>
 float run(T f, M const& a, M const& b, M& c)
 {
         float time = 0;

         for (int i = 0; i < RUNS; i++) {
                 struct timeval start, end;
                 gettimeofday(&start, NULL);
                 f(a,b,c);
                 gettimeofday(&end, NULL);

                 long s = start.tv_sec * 1000 + start.tv_usec / 1000;
                 long e = end.tv_sec * 1000 + end.tv_usec / 1000;

                 time += e - s;
         }
         return time / RUNS;
 }
 // SEQ MULTIPLICATION ----------------------------------------------------------
  int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
          for (int r = 0; r < MAXDIM; r++) {
                  for (int c = 0; c < MAXDIM; c++) {
                          z[r][c] = 0;
                          for (int i = 0; i < MAXDIM; i++)
                                  z[r][c] += a[r][i] * b[i][c];
                  }
          }
  }
  void mult_std(matrix const& a, matrix const& b, matrix& z) {
          for (int r = 0; r < a.dim(); r++) {
                  for (int c = 0; c < a.dim(); c++) {
                          z(r,c) = 0;
                          for (int i = 0; i < a.dim(); i++)
                                  z(r,c) += a(r,i) * b(i,c);
                  }
          }
  }

  // MAIN ------------------------------------------------------------------------
  using namespace std;
  int main(int argc, char* argv[]) {
          srand(time(NULL));

          int matrix_a[MAXDIM][MAXDIM];
          int matrix_b[MAXDIM][MAXDIM];
          int matrix_c[MAXDIM][MAXDIM];
          random_matrix(matrix_a);
          random_matrix(matrix_b);
          printf("%d ", MAXDIM);
          printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));

          matrix a(MAXDIM);
          matrix b(MAXDIM);
          matrix c(MAXDIM);
          random_matrix_class(a);
          random_matrix_class(b);
          printf("%d ", MAXDIM);
          printf("%f \n", run(mult_std, a, b, c));

          return 0;
  }

现在的结果:

$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000

$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000

由此可见,完全优化后的C代码速度大约是C ++代码的两倍.我在代码中看不到原因.

From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.

这篇关于如何在C ++中加快矩阵乘法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆