性能问题:C和C ++中的矩阵乘法 [英] Performance issue: matrix multiplication in C and C++

查看:95
本文介绍了性能问题:C和C ++中的矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我正在研究矩阵乘法代码(矩阵时间

矩阵),并且出现了一些有趣/混乱结果

关于(显然)相同算法的运行时间,当用C或C ++实现时,
。我注意到,与相同的(?)

C ++ - 实现相比,C中的实现更快了2.5倍。


C中的基本算法是:

函数main中的
/ * * /

idxC = 0;

for(i = 0; i< dimension; i ++)

for(k = 0; k< dimension; k ++){

idxA = i * dimension;

endA = idxA + dimension;

idxB = k;

while(idxA< endA){

Celem [ idxC] + = Aelem [idxA] * Belem [idxB];

idxA ++;

idxB + =维度;

};

idxC ++;

};


其中矩阵A,B和C使用1维实现

数组和逐行编号(即Fortran风格)。

例如,idxA指向矩阵A的当前元素A(i,j)。

在idxA ++之后,指针将指向元素A(i,j + 1)。

类似,idxA + = dimension将指针移动到元素A(i + 1,j)。 br />
Th我们,代码计算矩阵乘积C = A * B,同时尝试
避免显式计算矩阵索引。


C ++中的相应算法是:


模板< class MatElem>

void FortranMatrix< MatElem> :: mult(const FortranMatrix< MatElem>& B,

FortranMatrix< MatElem>& C)const {

/ * ... * /

int idxC = 0;

for(int i = 0; i< dimension ; i ++)

for(int k = 0; k< dimension; k ++){

int idxA = i * dimension;

int endA = idxA +维度;

int idxB = k;

while(idxA< endA){

C.elem [idxC] + = elem [idxA] * B.elem [idxB];

idxA ++;

idxB + =维度;

};

idxC ++;

};

}


这里,A,B和C是某些矩阵的对象具有

成员元素的类(类型为double *)。


对于相同大小的矩阵(729x729),C实现需要

在我的机器上大约10秒钟(Pentium III,gcc,版本3.2,

优化级别-O3和-funroll-loops)。

C ++ - 使用g ++(相同版本,相同的优化参数)在相同的

机器上实现大约需要25秒。


是否有人能够给出一个co关于

造成这种巨大性能损失的原因?


使用静态函数,并将双数组作为参数,即


模板< class MatElem>

void FortranMatrix< MatElem> :: stamult(const MatElem * A,

const MatElem * B,

MatElem * C,int dimension){

/ * ... * /

int idxC = 0;

for (int i = 0; I<尺寸; i ++)

for(int k = 0; k< dimension; k ++){

int idxA = i * dimension;

int endA = idxA +维度;

int idxB = k;

while(idxA< endA){

C [idxC] + = A [idxA ] * B [idxB];

idxA ++;

idxB + =维度;

};

idxC ++;

};

}


现在,这几乎与C算法相同,但代码仍然是

大约需要20秒。我实际上准备在使用C ++而不是C时看到某些

的性能损失。但是,我发现它让你感到困惑,因为你在代码中丢失了因子2甚至没有接近

使用可能会降低代码速度的反对导向功能。

我没有使用模板尝试它,但这不能使太好了

差别很大,可以吗?


我会对此表示感谢,我要感谢你

非常花时间阅读这篇相当长篇文章:-)


Michael

Hi,

I''m currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that''s not even close to
using objected oriented features that may slow the code down.
I didn''t try it without using templates, but that can''t make too
much of a difference, can it?

I''d appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article :-)

Michael

推荐答案

Michael Bader写道:

....
Michael Bader wrote:
....

C ++中的相应算法是:

template< class MatElem>
void FortranMatrix< MatElem> :: mult(const FortranMatrix< MatElem>& B,
FortranMatrix< MatElem>& C)const {
/ * ... * /
int idxC = 0;
for(int i = 0; I<尺寸; i ++)
for(int k = 0; k< dimension; k ++){
int idxA = i * dimension;
int endA = idxA + dimension;
int idxB = k ;
while(idxA< endA){
C.elem [idxC] + = elem [idxA] * B.elem [idxB];


C.elem [idxC]


您可以将其从循环中移除。


编译器可能无法优化elem的获取。数组,所以

你需要自己动手。


尝试类似的事情:


模板< ;类MatElem>

void FortranMatrix< MatElem> :: mult(const FortranMatrix< MatElem>& B,

FortranMatrix< MatElem>& C)const {

/ * ... * /

int idxC = 0;

for(int i = 0; i< dimension; i ++)

for(int k = 0; k< dimension; k ++){

int idxA = i * dimension;

int endA = idxA + dimension;

int idxB = k;

MatElem v = 0;

while(idxA< endA){

v + = elem [idxA] * B.elem [idxB];

idxA ++;

idxB + =维度;

};

C.elem [idxC] = v;

idxC ++;

};

}


- 仍然,您还可以获取地址this-> elem,B.elem和C.elem

并将它们放入临时值(尽管编译器可能会这样做

for this-> elem和B.elem sinc e * this和B是const但不是C.)。


idxA ++;
idxB + = dimension;
};
idxC ++;
};
}
这里,A,B和C是某些矩阵类的对象,它们具有
成员元素elem(类型为double *)。对于相同大小的矩阵(729x729),C实现在我的机器上大约需要10秒钟(Pentium III,gcc,版本3.2,
优化级别-O3和 - funroll-loops)。使用g ++(相同版本,相同的优化参数)在同一台机器上实现C ++ - 大约需要25秒。

任何人都可以在那里对这种巨大的性能损失的原因发表评论?

使用静态函数,并将双数组作为参数,即

模板<类MatElem>
void FortranMatrix< MatElem> :: stamult(const MatElem * A,
const MatElem * B,
MatElem * C,int dimension){
/ * ... * /
int idxC = 0;
for(int i = 0; I<尺寸; i ++)
for(int k = 0; k< dimension; k ++){
int idxA = i * dimension;
int endA = idxA + dimension;
int idxB = k ;
while(idxA< endA){
C [idxC] + = A [idxA] * B [idxB];
idxA ++;
idxB + = dimension;
};
idxC ++;
};
}
现在,这几乎与C算法相同,但代码仍然需要处理20秒我实际上已经准备好在使用C ++而不是C时看到某些性能损失。但是,我发现它让你感到困惑的是你在一个甚至不接近
使用可能会降低代码速度的面向对象的功能。
我没有使用模板就尝试过,但是这不能太多了,不是很重要,可以吗?


您使用的是哪个版本的gcc?

我将不胜感激任何评论,我要感谢你
非常花时间阅读这篇相当冗长的文章: - )

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can''t optimize the fetch of the "elem" array, so
you''ll need to do it yourself.

Try somthing like:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}

-- still, you can also fetch the addresses this->elem, B.elem and C.elem
and place them in temporaries (although, the compiler can probably do it
for this->elem and B.elem since *this and B are const but not for C.).

idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that''s not even close to
using objected oriented features that may slow the code down.
I didn''t try it without using templates, but that can''t make too
much of a difference, can it?
Which version of gcc are you using ?

I''d appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article :-)




尝试发布可编辑的代码。最后一个例子确实看起来像

应该至少和C代码一样快。在没有检查实际生成的汇编代码的情况下,很难知道为什么它正在做它正在做的事情。


G



Try posting compilable code. The last example really does look like it
should be at least as fast as the C code. Without examining the actual
generated assembly code, it''s hard to know why it''s doing what it''s doing.

G


" Michael Bader" < BA *** @ in.tum.de>写了
"Michael Bader" <ba***@in.tum.de> wrote


我正在研究矩阵乘法代码(矩阵时间矩阵),并且出现了一些有趣/混乱的结果<关于(显然)相同算法的运行时间,用C或C ++实现时。我注意到C中的实现比相同的(?)C ++实现快了2.5倍。

C中的基本算法是:函数main()* /
/ * idxC = 0;
for(i = 0; i< dimension; i ++)
for(k = 0; k< ; dimension; k ++){
idxA = i * dimension;
endA = idxA + dimension;
idxB = k;
while(idxA< endA){
Celem [idxC] + = Aelem [idxA] * Belem [idxB];
idxA ++;
idxB + = dimension;
};
idxC ++;
};

矩阵A,B和C使用1维
数组和逐行编号(即Fortran风格)实现。
例如,idxA点矩阵A的当前元素A(i,j)。
在idxA ++之后,指针将指向元素A(i,j + 1)。
类似,idxA + =维度将移动指针元素A(i + 1,j)。
因此,co de计算矩阵乘积C = A * B,同时试图避免显式计算矩阵索引。

C ++中的相应算法是:

模板< class MatElem>
void FortranMatrix< MatElem> :: mult(const FortranMatrix< MatElem>& B,
FortranMatrix< MatElem>& C)const {
/ * ... * /
int idxC = 0;
for(int i = 0; i< dimension; i ++)
for(int k = 0; k< dimension; k ++){
int idxA = i * dimension;
int endA = idxA + dimension;
int idxB = k;
while(idxA< endA ){
C.elem [idxC] + = elem [idxA] * B.elem [idxB];
idxA ++;
idxB + = dimension;
};
idxC ++;
};
}
这里,A,B和C是某些矩阵类的对象,它们具有
成员元素(类型为double) *)。

对于相同大小的矩阵(729x729),我的机器上的C实现大约需要10秒钟(Pentium III,gcc,版本3.2,
优化级别) -O3和-funroll-loops)。使用g ++(相同版本,相同的优化参数)在同一台机器上实现C ++ - 大约需要25秒。

是否有人在那里能够发表评论o n这种巨大的性能损失的原因是什么?

使用静态函数,并将双数组作为参数,即

模板<类MatElem>
void FortranMatrix< MatElem> :: stamult(const MatElem * A,
const MatElem * B,
MatElem * C,int dimension){
/ * ... * /
int idxC = 0;
for(int i = 0; I<尺寸; i ++)
for(int k = 0; k< dimension; k ++){
int idxA = i * dimension;
int endA = idxA + dimension;
int idxB = k ;
while(idxA< endA){
C [idxC] + = A [idxA] * B [idxB];
idxA ++;
idxB + = dimension;
};
idxC ++;
};
}
现在,这几乎与C算法相同,但代码仍然需要处理20秒我实际上已经准备好在使用C ++而不是C时看到某些性能损失。但是,我发现它让你感到困惑的是你在一个甚至不接近
使用可能会降低代码速度的面向对象的功能。
我没有使用模板尝试它,但是这不能太多了/不是很大的区别,可以吗?
我很感激对此发表任何评论,我非常感谢你花时间阅读这篇相当冗长的文章: - )
Hi,

I''m currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that''s not even close to
using objected oriented features that may slow the code down.
I didn''t try it without using templates, but that can''t make too
much of a difference, can it?

I''d appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article :-)




最可能的原因是g ++没有生成像gcc那样好的中间代码(实际上,g ++ 3.x生成更大更慢的代码)比g ++ 2.95)。

这很常见。我见过的最糟糕的差异是在90年代早期Watcom的C和

C ++编译器,其中一些代码运行速度慢了10倍或更多

用他们的C ++编译器编译而不是用他们的C编译器编译。这不是C和C ++语言之间的差异,而是两个编译器之间的差异。
实现。 C ++编译器没有C编译器那么长的历史,

和C ++编译器编写者不得不花费添加的时间和支持无数功能的
C没有,C编译器编写者已经能够投入更好的代码生成。 C ++最终将迎头赶上
(今天的C ++编译器与10年前的C编译器相比)。在

的同时,您当然可以在C中编写性能关键代码并将其链接到您的C ++应用程序中。


Claudio Puviani



The most likely reason is that g++ doesn''t generate intermediate code that''s as
good as gcc (in fact, g++ 3.x generates bigger and slower code than g++ 2.95).
This is fairly common. The worst difference I''ve seen was with Watcom''s C and
C++ compilers in the early 90''s where some code ran 10 times or more slower when
compiled with their C++ compiler than with their C compiler. This isn''t a
difference between the C and C++ languages, but between the two compiler
implementations. C++ compilers just don''t have as long a history as C compilers,
and in the time that C++ compiler writers have had to spend adding and
supporting the myriad features that C doesn''t have, the C compiler writers have
been able to invest in better code generation. C++ will eventually catch up
(today''s C++ compilers compare favorably with C compilers 10 years ago). In the
meantime, you can certainly write performance-critical code in C and link it
into your C++ application.

Claudio Puviani


Michael Bader< ba *** @ in.tum.de>写在

新闻:c2 ********** @ wsc10.lrz-muenchen.de:
Michael Bader <ba***@in.tum.de> wrote in
news:c2**********@wsc10.lrz-muenchen.de:


我正在研究矩阵乘法代码(矩阵时间矩阵),并且出现了一些有趣/混乱的结果
关于(显然)相同算法的运行时间,
用C或C ++实现时。我注意到C中的实现比相同的(?)C ++实现快了2.5倍。

C中的基本算法是:函数main()* /
/ * idxC = 0;
for(i = 0; i< dimension; i ++)
for(k = 0; k< ; dimension; k ++){
idxA = i * dimension;
endA = idxA + dimension;
idxB = k;
while(idxA< endA){
Celem [idxC] + = Aelem [idxA] * Belem [idxB];
idxA ++;
idxB + = dimension;
};
idxC ++;
};

矩阵A,B和C使用1维
数组和逐行编号(即Fortran风格)实现。
例如,idxA点矩阵A的当前元素A(i,j)。
在idxA ++之后,指针将指向元素A(i,j + 1)。
类似,idxA + =维度将移动指针元素A(i + 1,j)。
因此,co de计算矩阵乘积C = A * B,同时试图避免显式计算矩阵索引。

C ++中的相应算法是:

模板< class MatElem>
void FortranMatrix< MatElem> :: mult(const FortranMatrix< MatElem>& B,
FortranMatrix< MatElem>& C)
const {
/ * ... * /
int idxC = 0;
for(int i = 0; i< dimension; i ++)
for (int k = 0; k< dimension; k ++){
int idxA = i * dimension;
int endA = idxA + dimension;
int idxB = k;
while( idxA< endA){
C.elem [idxC] + = elem [idxA] * B.elem [idxB];
idxA ++;
idxB + = dimension;
} ;
idxC ++;
};
}
这里,A,B和C是某些矩阵类的对象,它们具有
成员元素(类型为double *)。
Hi,

I''m currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C)
const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).




显示,不要告诉。 FortranMatrix的实际类声明在哪里

< MatElem>?什么_is_ MatElem?显示它的声明。



Show, don''t tell. Where is the actual class declaration of FortranMatrix
<MatElem>? What _is_ MatElem? Show it''s declaration too.


这篇关于性能问题:C和C ++中的矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆