在C上使用OpenMP的同时并行 [英] Parallel for inside a while using OpenMP on C

查看:149
本文介绍了在C上使用OpenMP的同时并行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一段时间内进行并行处理,像这样:

while(!End){
    for(...;...;...) // the parallel for

    ...
    // serial code
}

for循环是while循环的唯一并行部分.如果这样做,我将有很多开销:

cycles = 0;
while(!End){ // 1k Million iterations aprox
    #pragma omp parallel for
    for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
        if(time[i] == cycles){
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
        }


    // serial code
    ++cycles;    

}

for循环的每个迭代都是相互独立的.

串行代码和并行代码之间存在依赖性.

解决方案

所以通常不必太担心将并行区域放入循环中,因为现代的openmp实现对于使用诸如线程团队之类的东西非常有效.只要循环中有很多工作,就可以了.但是在这里,外循环计数为〜1e9,内循环计数为〜256-每次迭代只完成很少的工作-开销可能与完成的工作量相当或比其差,并且性能会受到影响. /p>

所以这之间会有明显的区别

cycles = 0;
while(!End){ // 1k Million iterations aprox
    #pragma omp parallel for
    for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
        if(time[i] == cycles){
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
        } 

    // serial code
    ++cycles;    
}

和这个:

cycles = 0;
#pragma omp parallel
while(!End){ // 1k Million iterations aprox
    #pragma omp for
    for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
        if(time[i] == cycles){
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
        } 

    // serial code
    #pragma omp single 
    {
      ++cycles;    
    }
}

但是,实际上,不幸的是,每次迭代遍历整个时间阵列都是很慢的,而且(b)的工作不足以保持多个内核繁忙-它占用大量内存.如果有多个线程,即使由于内存争用,即使没有开销,您实际上也会比串行性能差.诚然,您在此处发布的内容只是一个示例,而不是您的真实代码,但是为什么不对时间数组进行预处理,以便可以检查下一个任务何时可以更新:

#include <stdio.h>
#include <stdlib.h>

struct tasktime_t {
    long int time;
    int task;
};

int stime_compare(const void *a, const void *b) {
    return ((struct tasktime_t *)a)->time - ((struct tasktime_t *)b)->time;
}

int main(int argc, char **argv) {
    const int n=256;
    const long int niters = 100000000l;
    long int time[n];
    int wbusy[n];
    int wfinished[n];

    for (int i=0; i<n; i++) {
        time[i] = rand() % niters;
        wbusy[i] = 1;
        wfinished[i] = 0;
    }

    struct tasktime_t stimes[n];

    for (int i=0; i<n; i++) {
        stimes[i].time = time[i];
        stimes[i].task = i;
    }

    qsort(stimes, n, sizeof(struct tasktime_t), stime_compare);

    long int cycles = 0;
    int next = 0;
    while(cycles < niters){ // 1k Million iterations aprox
        while ( (next < n) && (stimes[next].time == cycles) ) {
           int i = stimes[next].task;
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
           next++;
        }

        ++cycles;
    }

    return 0;
}

这比扫描方法的串行版本快大约5倍(并且比OpenMP版本快很多).即使您不断更新序列码中的时间/忙/忙数组,也可以使用 解决方案

So normally one doesn't have to worry too much about putting parallel regions into loops, as modern openmp implementations are pretty efficient about using things like thread teams and as long as there's lots of work in the loop you're fine. But here, with an outer loop count of ~1e9 and an inner loop count of ~256 - and very little work being done per iteration - the overhead is likely comparable to or worse than the amount of work being done and performance will suffer.

So there will be a noticeable difference between this:

cycles = 0;
while(!End){ // 1k Million iterations aprox
    #pragma omp parallel for
    for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
        if(time[i] == cycles){
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
        } 

    // serial code
    ++cycles;    
}

and this:

cycles = 0;
#pragma omp parallel
while(!End){ // 1k Million iterations aprox
    #pragma omp for
    for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
        if(time[i] == cycles){
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
        } 

    // serial code
    #pragma omp single 
    {
      ++cycles;    
    }
}

But really, that scan across the time array every iteration is unfortunately both (a) slow and (b) not enough work to keep multiple cores busy - it's memory intensive. With more than a couple of threads you will actually have worse performance than serial, even without overheads, just because of memory contention. Admittedly what you have posted here is just an example, not your real code, but why don't you preprocess the time array so you can just check to see when the next task is ready to update:

#include <stdio.h>
#include <stdlib.h>

struct tasktime_t {
    long int time;
    int task;
};

int stime_compare(const void *a, const void *b) {
    return ((struct tasktime_t *)a)->time - ((struct tasktime_t *)b)->time;
}

int main(int argc, char **argv) {
    const int n=256;
    const long int niters = 100000000l;
    long int time[n];
    int wbusy[n];
    int wfinished[n];

    for (int i=0; i<n; i++) {
        time[i] = rand() % niters;
        wbusy[i] = 1;
        wfinished[i] = 0;
    }

    struct tasktime_t stimes[n];

    for (int i=0; i<n; i++) {
        stimes[i].time = time[i];
        stimes[i].task = i;
    }

    qsort(stimes, n, sizeof(struct tasktime_t), stime_compare);

    long int cycles = 0;
    int next = 0;
    while(cycles < niters){ // 1k Million iterations aprox
        while ( (next < n) && (stimes[next].time == cycles) ) {
           int i = stimes[next].task;
           if (wbusy[i]){
               wbusy[i] = 0;
               wfinished[i] = 1;
           }
           next++;
        }

        ++cycles;
    }

    return 0;
}

This is ~5 times faster than the serial version of the scanning approach (and much faster than the OpenMP versions). Even if you are constantly updating the time/wbusy/wfinished arrays in the serial code, you can keep track of their completion times using a priority queue with each update taking O(ln(N)) time instead of scanning every iteration taking O(N) time.

这篇关于在C上使用OpenMP的同时并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆