使用OpenMP减少:线性合并或对数(线程数)合并 [英] Reduction with OpenMP: linear merging or log(number of threads) merging

查看:147
本文介绍了使用OpenMP减少:线性合并或对数(线程数)合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于OpenMP的减少,我有一个普遍的问题,这困扰了我一段时间.我的问题是将部分款项合并为减少额.它既可以线性完成,也可以作为线程数的对数完成.

I have a general question about reductions with OpenMP that's bothered me for a while. My question is in regards to merging the partial sums in a reduction. It can either be done linearly or as the log of the number of threads.

假设我要对某些功能double foo(int i)进行简化.有了OpenMP,我就可以做到这一点.

Let's assume I want to do a reduction of some function double foo(int i). With OpenMP I could do it like this.

double sum = 0.0;    
#pragma omp parallel for reduction (+:sum)
for(int i=0; i<n; i++) {
    sum += f(i);
}

但是,我声称以下代码将同样有效.

However, I claim that the following code will be just as efficient.

double sum = 0.0;
#pragma omp parallel
{
    double sum_private = 0.0;
    #pragma omp for nowait
    for(int i=0; i<n; i++) {
        sum_private += f(i)
    }
    #pragma omp critical
    {
        sum += sum_private;
    }
}

不是,这第二种代码案例实际上只能具有相同的性能,但是它更通用.它可以处理我定义的任何运算符,而归约构造仅适用于普通旧数据类型上的一些基本运算符.

Not, only will this second code case have effectively the same performance but it's more general. It can handle any operator I define whereas the reduction construct only works for some basic operators on plain old data types.

我们假设有t个线程.我声称第二种方法之所以快,是因为与并行循环相比,合并部分和的时间可忽略不计.进行部分和的时间与n/t成正比,合并和的时间为t.因此,只要n>>t或执行并行循环所花费的时间(如果foo慢于求和,则)足够大,则合并将可忽略不计.

Let's assume there are t threads. The reason I claim this second method is just as fast is that the time to merge the partial sums is negligible compared to the parallel loop. The time to do the partial sums is proportional to n/t and the time to merge the sums goes as t. So as long as n>>t or the time it takes to do the parallel loop (if foo is slow compare to summing) is large enough the merging will be negligible.

我听说可以合并O(log(t))中的部分和.但是,出于所有实际目的,我看不出这有什么帮助. OpenMP中的物理内核的最大数量为50,假设为64.与以并行循环进行操作相比,以64步或8个二进制步骤合并64个值不会有太大区别.此外,合并某种二叉树中的值可能会产生比仅进行线性合并更大的开销,因此甚至不一定更快.

I have heard it's possible to merge the partial sums in O(log(t)). However, for all practical purposes I don't see how this will help. The maximum number of physical cores in OpenMP is on order 50, let's assume it's 64. It won't make much difference to merge 64 values in 64 steps or in eight binary steps compared to doing the parallel loop. Additionally, merging the values in some kind of binary tree could have an overhead which is larger than just doing the linear merge so it's not even necessarily faster.

何时合并O(log(t))中的部分和会有所帮助?什么时候第一个代码案例比第二个代码案例具有性能优势?

When would merging the partial sums in O(log(t)) ever help? When would the first code case ever have a performance advantage over the second code case?

我知道一些同事使用OpenCL在GPU上的O(log(t))中进行合并(通过对每个二进制合并运行几次内核),但是我还没有证据表明它比线性合并更好.

I know some coworkers who merge in O(log(t)) on the GPU with OpenCL (by running the kernel several times for each binary merge) but I have not seen any evidence yet to show it's better than just merging linearly.

编辑:吉姆·考尼(Jim Cownie)希望看到一个实际的测试,而不是声明.以下是结果和代码.这是在具有四个物理核心的Xeon E5-1620(Sandy Bridge)上使用MSVC2012 64位发布模式完成的.第一种情况和第二种情况都比没有OpenMP时快约4.45倍.

Jim Cownie wanted to see an actual test rather than a claim. Below is the results and code. This was done with MSVC2012 64-bit release mode on a Xeon E5-1620 (Sandy Bridge) with four physical cores. Both the first and second case are about exactly 4.45x faster than without OpenMP.

结果:

without OpenMP time 1.787158 s
first case     time 0.400462 s
second case    time 0.400456 s

代码:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

double foo(int i) {
    double fi = i;
    return 1.0*fi/(1+fi*fi);
}

double reduce(int n) {
    double sum = 0.0f;
    for(int i=0; i<n; i++) {
        sum += foo(i);
    }
    return sum;
}

double reduce_omp(int n) {
    double sum = 0.0f;
    #pragma omp parallel for reduction(+:sum)
    for(int i=0; i<n; i++) {
        sum += foo(i);
    }
    return sum;
}

double reduce_omp2(int n) {
    double sum = 0.0f;
    #pragma omp parallel 
    {
        double sum_private = 0.0f;
        #pragma omp for nowait
        for(int i=0; i<n; i++) {
            sum_private += foo(i);
        }
        #pragma omp critical 
        {
            sum+= sum_private;
        }
    }
    return sum;
}

int main() {
    int n,r;
    double sum, dtime;
    n = 1<<28;
    r = 1;

    dtime = omp_get_wtime();
    for(int i=0; i<r; i++) sum = reduce(n);
    dtime = omp_get_wtime() - dtime;
    printf("time %f, sum %f\n", dtime, sum);

    reduce_omp(n);  //warm omp up

    dtime = omp_get_wtime();
    for(int i=0; i<r; i++) sum = reduce_omp(n);
    dtime = omp_get_wtime() - dtime;
    printf("time %f, sum %f\n", dtime, sum);

    dtime = omp_get_wtime();
    for(int i=0; i<r; i++) sum = reduce_omp2(n);
    dtime = omp_get_wtime() - dtime;
    printf("time %f, sum %f\n", dtime, sum);


}

推荐答案

OpenMP实现将基于实现者对运行的硬件的特定特性的了解,来决定进行简化的最佳方法.在具有少量CPU的系统上,它可能会线性减少.在具有成百上千个内核(例如GPU,Intel Phi)的系统上,它可能会减少log(n).

The OpenMP implementation will make a decision about the best way to do the reduction based on the implementor's knowledge of the specific characteristics of the hardware it's running on. On system with a small number of CPUs, it will probably do a linear reduction. On a system with hundreds or thousands of cores (e.g. GPU, Intel Phi) it will likely do a log(n) reduction.

减少时间所花的时间对于非常大的问题可能并不重要,但是对于较小的问题,可以将其总运行时间增加百分之几.在许多情况下,您的实现速度可能会一样快,但是我怀疑它会更快吗,那么为什么不让OpenMP决定最佳的减少策略呢?

The time spent in the reduction might not matter for very large problems, but for smaller problems it could be add a few percent to the total runtime. Your implementation might be just as fast in many cases, but I doubt it would ever be faster, so why not let OpenMP decide on the optimal reduction strategy?

这篇关于使用OpenMP减少:线性合并或对数(线程数)合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆