可编译器优化通过多线程被抑制? [英] May compiler optimizations be inhibited by multi-threading?

查看:206
本文介绍了可编译器优化通过多线程被抑制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这发生在我身上了几次并行与OpenMP程序部分只是要注意到,在最后,尽管良好的扩展性,大部分预见加速的是由于输给了单线程的情况下表现不佳(如果相比串行版本)。

It happened to me a few times to parallelize portion of programs with OpenMP just to notice that in the end, despite the good scalability, most of the foreseen speed-up was lost due to the poor performance of the single threaded case (if compared to the serial version).

这是网络这种行为上显示的通常的解释是的由编译器生成的code可在多线程的情况下的更糟。反正我不能够在任何地方找到对的为什么的大会可能会加重病情的参考。

The usual explanation that appears on the web for this behavior is that the code generated by compilers may be worse in the multi-threaded case. Anyhow I am not able to find anywhere a reference that explains why the assembly may be worse.

所以,我想什么要问编译器的家伙在那里是:

So, what I would like to ask to the compiler guys out there is:

五月编译器优化通过多线程被抑制?在情况下,怎么可能性能会受到影响?

如果它可以帮助缩小我主要兴趣在高性能计算的问题。

If it could help narrowing down the question I am mainly interested in high-performance computing.

免责声明:由于在评论中指出,问题的答案的一部分,下面可能在未来成为过时的,因为他们简单地讨论一下其中的优化是由编译器在问题被提出时的处理方式。

Disclaimer: As stated in the comments, part of the answers below may become obsolete in the future as they briefly discuss the way in which optimizations are handled by compilers at the time the question was posed.

推荐答案

我觉得这个答案说明理由充分,但我会在这里展开一下。

I think this answer describes the reason sufficiently, but I'll expand a bit here.

在,但是,这里的的 GCC 4.8对 -fopenmp 文档

Before, however, here's gcc 4.8's documentation on -fopenmp:

-fopenmp 结果
      启用C / C OpenMP指令的#pragma OMP ++和!使用Fortran $ OMP处理。 //:当指定-fopenmp,编译器根据的OpenMP应用程序接口3.0 HTTP生成并行code www.openmp.org/ 。这种选择意味着-pthread,并因此仅支持在有-pthread支持的目标。

-fopenmp
Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 http://www.openmp.org/. This option implies -pthread, and thus is only supported on targets that have support for -pthread.

请注意,它没有指定的任何功能禁用。事实上,没有理由要gcc禁用任何优化。

Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.

之所以但是为什么OpenMP的1线程的开销就没有OpenMP是编译器需要转换code,增加功能,所以这将是准备与N> 1线程OpenMP的案件事实。因此,让我们想到一个简单的例子:

The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:

int *b = ...
int *c = ...
int a = 0;

#omp parallel for reduction(+:a)
for (i = 0; i < 100; ++i)
    a += b[i] + c[i];

这code应该转换为这样的事:

This code should be converted to something like this:

struct __omp_func1_data
{
    int start;
    int end;
    int *b;
    int *c;
    int a;
};

void *__omp_func1(void *data)
{
    struct __omp_func1_data *d = data;
    int i;

    d->a = 0;
    for (i = d->start; i < d->end; ++i)
        d->a += d->b[i] + d->c[i];

    return NULL;
}

...
for (t = 1; t < nthreads; ++t)
    /* create_thread with __omp_func1 function */
/* for master thread, don't create a thread */
struct master_data md = {
    .start = /*...*/,
    .end = /*...*/
    .b = b,
    .c = c
};

__omp_func1(&md);
a += md.a;
for (t = 1; t < nthreads; ++t)
{
    /* join with thread */
    /* add thread_data->a to a */
}

现在,如果我们运行这个与确定nthreads == 1 中,code有效得到简化为:

Now if we run this with nthreads==1, the code effectively gets reduced to:

struct __omp_func1_data
{
    int start;
    int end;
    int *b;
    int *c;
    int a;
};

void *__omp_func1(void *data)
{
    struct __omp_func1_data *d = data;
    int i;

    d->a = 0;
    for (i = d->start; i < d->end; ++i)
        d->a += d->b[i] + d->c[i];

    return NULL;
}

...
struct master_data md = {
    .start = 0,
    .end = 100
    .b = b,
    .c = c
};

__omp_func1(&md);
a += md.a;

那么,什么是没有的OpenMP版本和单线程的OpenMP版本之间的区别?

So what are the differences between the no openmp version and the single threaded openmp version?

一个区别是,有多余的胶code。需要传递给OpenMP所创建的函数的变量需要放在一起形成一个参数。因此对于函数调用一些开销preparing(以及后来的检索数据)

One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)

然而,更重要的是,现在的code不是在一块了。跨功能的优化不是那么先进又和最优化的每个函数中完成。较小的功能意味着有可能性越小优化。

More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.

要完成这个答案,我想向你展示 -fopenmp 究竟如何影响 GCC 的选项。 (注:我是一个老的电脑上了,所以我有GCC 4.4.3)

To finish this answer, I'd like to show you exactly how -fopenmp affects gcc's options. (Note: I'm on an old computer now, so I have gcc 4.4.3)

运行 GCC -Q -v some_file.c 给出了这样的(相关)输出:

Running gcc -Q -v some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106
options passed:  -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486
 -fstack-protector
options enabled:  -falign-loops -fargument-alias -fauto-inc-dec
 -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining
 -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident
 -finline-functions-called-once -fira-share-save-slots
 -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore
 -fmath-errno -fmerge-debug-strings -fmove-loop-invariants
 -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec
 -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller
 -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im
 -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion
 -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model
 -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387
 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4
 -mpush-args -msahf -mtls-direct-seg-refs

和运行 GCC -Q -v -fopenmp some_file.c 给出了这样的(相关)输出:

and running gcc -Q -v -fopenmp some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106
options passed:  -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic
 -march=i486 -fopenmp -fstack-protector
options enabled:  -falign-loops -fargument-alias -fauto-inc-dec
 -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining
 -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident
 -finline-functions-called-once -fira-share-save-slots
 -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore
 -fmath-errno -fmerge-debug-strings -fmove-loop-invariants
 -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec
 -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller
 -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im
 -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion
 -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model
 -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387
 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4
 -mpush-args -msahf -mtls-direct-seg-refs

隔空差异,我们可以看到,唯一的区别是,与 -fopenmp ,我们有 -D_REENTRANT 定义(当然 -fopenmp 启用)。所以,放心,GCC不会产生恶化code。它只是它需要添加preparation code代表当线程的数目大于1,而且具有一定的开销。

Taking a diff, we can see that the only difference is that with -fopenmp, we have -D_REENTRANT defined (and of course -fopenmp enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.

更新:我真的应该启用优化测试这一点。无论如何,GCC 4.7.3,同样的命令的输出,增加 -O3 将给予同样的差异。所以,即使有 -O3 ,没有优化的禁用。

Update: I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added -O3 will give the same difference. So, even with -O3, there are no optimization's disabled.

这篇关于可编译器优化通过多线程被抑制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆