OpenMP C程序的运行速度比顺序代码慢 [英] OpenMP C program run slower than sequential code

查看:130
本文介绍了OpenMP C程序的运行速度比顺序代码慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是OpenMP的新手,试图并行化Jarvis的算法.但是,事实证明,与顺序代码相比,并行程序要花费2-3倍的时间.

I am a newbie to OpenMP, trying to parallelize Jarvis's algorithm. However it turns out that the parallel program take 2-3 times longer compare to sequential code.

问题本身无法并行化吗?或者我在并行化方面有问题.

Is it that the problem itself cannot be parallelize? Or there is something wrong in how i parallelize it.

这是我的问题的openMP程序,其中两部分并行化:

This is my openMP program for the problem, with 2 parts being parallelize:

#include <stdio.h>
#include <sys/time.h>
#include <omp.h>

typedef struct Point
{
int x, y;
} Point;

// To find orientation of ordered triplet (p, q, r).
// The function returns
// 0 for colinear points
// 1 as Clockwise
// 2 as Counterclockwise
int orientation(Point p, Point i, Point q)
{
int val = (i.y - p.y) * (q.x - i.x) -
          (i.x - p.x) * (q.y - i.y);
if (val == 0) return 0;  // colinear
return (val > 0)? 1: 2; // clock or counterclock wise
}

// Prints convex hull of a set of n points.
void convexHull(Point points[], int n)
{
// There must be at least 3 points
if (n < 3) return;

// Initialize array to store results
Point results[n];
int count = 0;

// Find the leftmost point
int l = 0,i;

#pragma omg parallel shared (n,l) private (i)
{
    #pragma omp for
    for (i = 1; i < n; i++)
    {
        #pragma omp critical
        {
            if (points[i].x < points[l].x)
            l = i;
        }
    }

}

// Start from leftmost point, keep moving counterclockwise
// until reach the start point again.
int p = l, q;
do
{
    // Add current point to result
    results[count]= points[p];
    count++;

    q = (p+1)%n;
    int k;

    #pragma omp parallel shared (p) private (k)
    {
        #pragma omp for 
        for (k = 0; k < n; k++)
        {
           // If i is more counterclockwise than current q, then
           // update i as new q
           #pragma omp critical
           {
            if (orientation(points[p], points[k], points[q]) == 2)
               q = k;
           }
        }       

    }

    // Now q is the most counterclockwise with respect to p
    // Set p as q for next iteration, to add q to result
    p = q;


} while (p != l);  // While algorithm does not return to first point

// Print Result
int j;
for (j = 0; j < count; j++){
  printf("(%d,%d)\n", results[j].x,results[j].y);
}

}

int main()
{
//declaration for start time, end time
//and total executions for the algorithm
struct timeval start, end;
int i, num_run = 100;

gettimeofday(&start,NULL);

Point points[] = {{0, 3}, {2, 2}, {1, 1}, {2, 1},
                    {3, 0}, {0, 0}, {3, 3}};

int n = sizeof(points)/sizeof(points[0]);

convexHull(points, n);

gettimeofday(&end,NULL);

int cpu_time_used = (((end.tv_sec - start.tv_sec) * 1000000) + (end.tv_usec 
- start.tv_usec));
printf("\n\nExecution time: %d ms\n", cpu_time_used);
return 0;
}


试图通过添加以下代码行来使输入足够真实:


Tried to make the input subtantial enough by adding in these lines of code:

Point points[3000];
int i;
for(i=0;i<3000;i++) {
    points[i].x = rand()%100;
    points[i].y = rand()%100;
    int j;
    for(j=i+1;j<3000;j++) {
        if(points[i].x==points[j].x) {
            if(points[i].y==points[j].y) {
            i--; 
            break;
            }
        }
    }
}

但是有时会崩溃

推荐答案

在以下代码中,并行for循环的全部内容都包装在critical语句中.这意味着这部分代码永远不会一次在线程上输入.一次有多个线程工作不会比单个线程经过所有迭代要快.但是最重​​要的是,同步开销也浪费了一些时间(每个线程必须在进入关键部分之前获取一个互斥体,然后再释放该互斥体).

In the following piece of your code, the whole content of the parallel for loop is wrapped into a critical statement. This means that this part of the code will never be entered by more than on thread at a time. Having multiple threads work one at a time will not go faster than if a single thread had gone through all iterations. But on top of that some time is lost in synchronization overhead (each thread must acquire a mutex before entering the critical section and release it afterwards).

int l = 0,i;
#pragma omp parallel shared (n,l) private (i)
{
    #pragma omp for
    for (i = 1; i < n; i++)
    {
        #pragma omp critical
        {
            if (points[i].x < points[l].x)
            l = i;
        }
    }
}

需要对串行代码进行某种程度的重构以实现并行化.减少通常是简单操作的一种好方法:让每个线程在迭代的一部分上计算部分结果(例如部分最小值,部分和),而不是将所有结果合并为全局结果.对于受支持的操作,可以使用#pragma omp for reduction(op:var)语法.但是在这种情况下,减少操作必须手动完成.

The serial code needs to be somewhat refactored for parallelization. Reduction is often a good approach for simple operations: have each thread compute a partial result on one part of the iterations (e.g. partial minimum, partial sum) than merge all the results into a global one. For supported operations, the #pragma omp for reduction(op:var) syntax can be used. But in this case, the reduction has to be done manually.

查看下面的代码如何依赖局部变量来跟踪最小值x的索引,然后使用单个关键部分来计算全局最小值索引.

See how the following code relies on local variables to track the index of minimum x, then uses a single critical section to compute the global minimum index.

int l = 0,i;
#pragma omp parallel shared (n,l) private (i)
{
    int l_local = 0; //This variable is private to the thread

    #pragma omp for nowait
    for (i = 1; i < n; i++)
    {
        // This part of the code can be executed in parallel
        // since all write operations are on thread-local variables
        if (points[i].x < points[l_local].x)
            l_local = i;
    }

    // The critical section is entered only once by each thread
    #pragma omp critical
    {
    if (points[l_local].x < points[l].x)
        l = l_local;
    }

    #pragma omp barrier
    // a barrier is needed in case some more code follow
    // otherwise there is an implicit barrier at the end of the parallel region
}

第二个并行循环应使用相同的原理,该循环还存在由critical语句实际完全序列化的相同问题.

The same principle should be applied to the second parallel loop, which suffer from the same issue of actually being entirely serialized by the critical statement.

这篇关于OpenMP C程序的运行速度比顺序代码慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆