即使在num_threads(1)时,openmp也无法实现令人难以置信的性能改进 [英] incomprehensible performance improvement with openmp even when num_threads(1)
问题描述
以下代码行
int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );
unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
在使用i5-3230M进行编译时,需要11130个usecs来运行
g++ -o main main.cpp -std=c++0x -O3
也就是说,当openmp编译指示被忽略时.
另一方面,使用编译时,只需要1496个usecs
g++ -o main main.cpp -std=c++0x -O3 -fopenmp
这快了6倍以上,考虑到它是在2核计算机上运行的,这非常令人惊讶.实际上,我还使用 num_threads(1)对其进行了测试,并且性能提升仍然非常重要(快3倍以上).
任何人都可以帮助我了解这种行为吗?
按照建议,我提供完整的代码:
#include <stdlib.h>
#include <iostream>
#include <chrono>
#include <cassert>
int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;
void func()
{
unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
}
int main()
{
// alloc & initializacion
buff = (unsigned char *) malloc( numel );
assert(buff != NULL);
for(int k=0; k<numel; k++)
buff[k] = 0;
//
std::chrono::high_resolution_clock::time_point begin;
std::chrono::high_resolution_clock::time_point end;
begin = std::chrono::high_resolution_clock::now();
//
for(int k=0; k<100; k++)
func();
//
end = std::chrono::high_resolution_clock::now();
auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;
return 0;
}
事实证明,答案是firstprivate(pbuff, nrows, ncols)
有效地将pbuff
,nrows
和ncols
声明为在以下范围内的局部变量for循环.反过来,这意味着编译器可以将nrows
和ncols
视为常量-它不能对全局变量做出相同的假设!
因此,使用-fopenmp
会导致巨大的加速,因为并非每次迭代都访问全局变量. (此外,使用恒定的ncols
值,编译器可以进行一些循环展开).
通过更改
int nrows = 4096;
int ncols = 4096;
到
const int nrows = 4096;
const int ncols = 4096;
或(通过更改
)for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
到
int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
for (int j=0; j<_ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
异常加速消失了-非OpenMP代码现在与OpenMP代码一样快.
故事的寓意?避免在性能关键的循环中访问可变的全局变量.
The following lines of code
int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );
unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
take 11130 usecs to run on my i5-3230M when compiled with
g++ -o main main.cpp -std=c++0x -O3
That is, when the openmp pragmas are ignored.
On the other hand, it only takes 1496 usecs when compiled with
g++ -o main main.cpp -std=c++0x -O3 -fopenmp
This is more than 6 times faster, which is quite surprising taking into acount that it is run on a 2-core machine. In fact, I have also tested it with num_threads(1) and the performance improvement is still quite important (more than 3 times faster).
Anybody can help me to understand this behaviour?
EDIT: following the suggestions, I provide the full piece of code:
#include <stdlib.h>
#include <iostream>
#include <chrono>
#include <cassert>
int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;
void func()
{
unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
}
int main()
{
// alloc & initializacion
buff = (unsigned char *) malloc( numel );
assert(buff != NULL);
for(int k=0; k<numel; k++)
buff[k] = 0;
//
std::chrono::high_resolution_clock::time_point begin;
std::chrono::high_resolution_clock::time_point end;
begin = std::chrono::high_resolution_clock::now();
//
for(int k=0; k<100; k++)
func();
//
end = std::chrono::high_resolution_clock::now();
auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;
return 0;
}
The answer, as it turns out, is that firstprivate(pbuff, nrows, ncols)
effectively declares pbuff
, nrows
and ncols
as local variables within the scope of the for loop. That in turn means the compiler can see nrows
and ncols
as constants - it cannot make the same assumption about global variables!
Consequently, with -fopenmp
, you end up with the huge speedup because you aren't accessing a global variable each iteration. (Plus, with a constant ncols
value, the compiler gets to do a bit of loop unrolling).
By changing
int nrows = 4096;
int ncols = 4096;
to
const int nrows = 4096;
const int ncols = 4096;
or by changing
for (int i=0; i<nrows; i++)
{
for (int j=0; j<ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
to
int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
for (int j=0; j<_ncols; j++)
{
*pbuff += 1;
pbuff++;
}
}
the anomalous speedup vanishes - the non-OpenMP code is now just as fast as the OpenMP code.
The moral of the story? Avoid accessing mutable global variables inside performance-critical loops.
这篇关于即使在num_threads(1)时,openmp也无法实现令人难以置信的性能改进的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!