即使在num_threads(1)时,openmp也无法实现令人难以置信的性能改进 [英] incomprehensible performance improvement with openmp even when num_threads(1)

查看:96
本文介绍了即使在num_threads(1)时,openmp也无法实现令人难以置信的性能改进的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码行

int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );

unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

在使用i5-3230M进行编译时,需要11130个usecs来运行

g++ -o main main.cpp -std=c++0x -O3

也就是说,当openmp编译指示被忽略时.

另一方面,使用编译时,只需要1496个usecs

g++ -o main main.cpp -std=c++0x -O3 -fopenmp

这快了6倍以上,考虑到它是在2核计算机上运行的,这非常令人惊讶.实际上,我还使用 num_threads(1)对其进行了测试,并且性能提升仍然非常重要(快3倍以上).

任何人都可以帮助我了解这种行为吗?

按照建议,我提供完整的代码:

#include <stdlib.h>
#include <iostream>

#include <chrono>
#include <cassert>


int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;


void func()
{
    unsigned char *pbuff = buff;
    #pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
    for (int i=0; i<nrows; i++)
    {
        for (int j=0; j<ncols; j++)
        {
            *pbuff += 1;
            pbuff++;
        }
    }
}


int main()
{
    // alloc & initializacion
    buff = (unsigned char *) malloc( numel );
    assert(buff != NULL);
    for(int k=0; k<numel; k++)
        buff[k] = 0;

    //
    std::chrono::high_resolution_clock::time_point begin;
    std::chrono::high_resolution_clock::time_point end;
    begin = std::chrono::high_resolution_clock::now();      
    //
    for(int k=0; k<100; k++)
        func();
    //
    end = std::chrono::high_resolution_clock::now();
    auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
    std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;

    return 0;
}

事实证明,答案是firstprivate(pbuff, nrows, ncols)有效地将pbuffnrowsncols声明为在以下范围内的局部变量for循环.反过来,这意味着编译器可以将nrowsncols视为常量-它不能对全局变量做出相同的假设!

因此,使用-fopenmp会导致巨大的加速,因为并非每次迭代都访问全局变量. (此外,使用恒定的ncols值,编译器可以进行一些循环展开).

通过更改

int nrows = 4096;
int ncols = 4096;

const int nrows = 4096;
const int ncols = 4096;

(通过更改

)

for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
    for (int j=0; j<_ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

异常加速消失了-非OpenMP代码现在与OpenMP代码一样快.

故事的寓意?避免在性能关键的循环中访问可变的全局变量.

The following lines of code

int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );

unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

take 11130 usecs to run on my i5-3230M when compiled with

g++ -o main main.cpp -std=c++0x -O3

That is, when the openmp pragmas are ignored.

On the other hand, it only takes 1496 usecs when compiled with

g++ -o main main.cpp -std=c++0x -O3 -fopenmp

This is more than 6 times faster, which is quite surprising taking into acount that it is run on a 2-core machine. In fact, I have also tested it with num_threads(1) and the performance improvement is still quite important (more than 3 times faster).

Anybody can help me to understand this behaviour?

EDIT: following the suggestions, I provide the full piece of code:

#include <stdlib.h>
#include <iostream>

#include <chrono>
#include <cassert>


int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;


void func()
{
    unsigned char *pbuff = buff;
    #pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
    for (int i=0; i<nrows; i++)
    {
        for (int j=0; j<ncols; j++)
        {
            *pbuff += 1;
            pbuff++;
        }
    }
}


int main()
{
    // alloc & initializacion
    buff = (unsigned char *) malloc( numel );
    assert(buff != NULL);
    for(int k=0; k<numel; k++)
        buff[k] = 0;

    //
    std::chrono::high_resolution_clock::time_point begin;
    std::chrono::high_resolution_clock::time_point end;
    begin = std::chrono::high_resolution_clock::now();      
    //
    for(int k=0; k<100; k++)
        func();
    //
    end = std::chrono::high_resolution_clock::now();
    auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
    std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;

    return 0;
}

解决方案

The answer, as it turns out, is that firstprivate(pbuff, nrows, ncols) effectively declares pbuff, nrows and ncols as local variables within the scope of the for loop. That in turn means the compiler can see nrows and ncols as constants - it cannot make the same assumption about global variables!

Consequently, with -fopenmp, you end up with the huge speedup because you aren't accessing a global variable each iteration. (Plus, with a constant ncols value, the compiler gets to do a bit of loop unrolling).

By changing

int nrows = 4096;
int ncols = 4096;

to

const int nrows = 4096;
const int ncols = 4096;

or by changing

for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

to

int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
    for (int j=0; j<_ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

the anomalous speedup vanishes - the non-OpenMP code is now just as fast as the OpenMP code.

The moral of the story? Avoid accessing mutable global variables inside performance-critical loops.

这篇关于即使在num_threads(1)时,openmp也无法实现令人难以置信的性能改进的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆