çOpenMP并行快速排序 [英] C OpenMP parallel quickSort

查看:645
本文介绍了çOpenMP并行快速排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我再次卡住我用C使用OpenMP ++时。这一次,我想实现一个快速排序并行

Once again I'm stuck when using openMP in C++. This time I'm trying to implement a parallel quicksort.

code:

#include <iostream>
#include <vector>
#include <stack>
#include <utility>
#include <omp.h>
#include <stdio.h>

#define SWITCH_LIMIT 1000

using namespace std;

template <typename T>
void insertionSort(std::vector<T> &v, int q, int r)
{
    int key, i;
    for(int j = q + 1; j <= r; ++j)
    {
        key = v[j];
        i = j - 1;
        while( i >= q && v[i] > key )
        {
            v[i+1] = v[i];
            --i;
        }
        v[i+1] = key;
    }
}

stack<pair<int,int> > s;

template <typename T>
void qs(vector<T> &v, int q, int r)
{
    T pivot;
    int i = q - 1, j = r;
    //switch to insertion sort for small data
    if(r - q < SWITCH_LIMIT) 
    {
        insertionSort(v, q, r);
        return;
    }

    pivot = v[r];
    while(true)
    {
        while(v[++i] < pivot);
        while(v[--j] > pivot);
        if(i >= j) break;
        std::swap(v[i], v[j]); 
    }
    std::swap(v[i], v[r]);

    #pragma omp critical
    {
        s.push(make_pair(q, i - 1));
        s.push(make_pair(i + 1, r));        
    }
}

int main()
{
    int n, x;
    int numThreads = 4, numBusyThreads = 0;
    bool *idle = new bool[numThreads];
    for(int i = 0; i < numThreads; ++i)
        idle[i] = true;
    pair<int, int> p;
    vector<int> v;
    cin >> n;
    for(int i = 0; i < n; ++i)
    {
        cin >> x;
        v.push_back(x);
    }
    cout << v.size() << endl;
    s.push(make_pair(0, v.size()));

    #pragma omp parallel shared(s, v, idle, numThreads, numBusyThreads, p) 
    {
        bool done = false;
        while(!done) 
        {
            int id = omp_get_thread_num();
            #pragma omp critical
            {
                if(s.empty() == false && numBusyThreads < numThreads) 
                {
                    ++numBusyThreads;
                    //the current thread is not idle anymore
                    //it will get the interval [q, r] from stack
                    //and run qs on it
                    idle[id] = false;
                    p = s.top();                    
                    s.pop();
                }
                if(numBusyThreads == 0)
                {
                    done = true;
                }
            }
            if(idle[id] == false)
            {

                qs(v, p.first, p.second);
                idle[id] = true;
                #pragma omp critical 
                --numBusyThreads;
            }

        }
    }
    return 0;
}

算法:

要使用OpenMP的,因为我使用的堆栈跟踪上了QS功能应该运行的下一个间隔的递归函数。我手动添加的第一个区间[0,大小],然后让线程时得到一个新的时间间隔在堆栈中添加工作。

To use openMP for a recursive function I used a stack to keep track of the next intervals on which the qs function should run. I manually add the 1st interval [0, size] and then let the threads get to work when a new interval is added in the stack.

问题:

程序结束太早,阵列创建第一组区间([Q,我 - 1]后不排序,第[i + 1,R]如果你看一下在code我的猜测是,这让工作线程,缺省认为共享的快速排序功能(在code QS)的局部变量,所以他们乱起来,并在堆栈中添加无间隔。

The program ends too early, not sorting the array after creating the 1st set of intervals ([q, i - 1], [i+1, r] if you look on the code. My guess is that the threads which get the work, considers the local variables of the quicksort function(qs in the code) shared by default, so they mess them up and add no interval in the stack.

如何编译:

g++ -o qs qs.cc -Wall -fopenmp

我如何运行:

./ QS&LT; in_100000&GT; out_100000

在这里in_100000是一号线后跟100K intergers上用空格隔开下一行包含100000的文件。

where in_100000 is a file containing 100000 on the 1st line followed by 100k intergers on the next line separated by spaces.

我用gcc 4.5.2在Linux

I am using gcc 4.5.2 on linux

感谢你的帮助,

推荐答案

我其实没有运行code,但我看到的 P 立即错误,这应该是私人不是共享。 将对比赛<; QS(V时,p.first,p.second)的 QS 并行调用code> p ,导致未predictable行为。在 QS局部变量应该没问题,因为所有的线程都有自己的堆栈。但是,总的做法是好的。你在正确的轨道上。

I didn't actually run your code, but I see an immediate mistake on p, which should be private not shared. The parallel invocation of qs: qs(v, p.first, p.second); will have races on p, resulting in unpredictable behavior. The local variables at qs should be okay because all threads have their own stack. However, the overall approach is good. You're on the right track.

下面是我对平行快速排序的实施一般性评论。快速排序本身的尴尬的并行的,这意味着不需要同步。 QS 的分区阵列上的递归调用是尴尬的并行。

Here are my general comments for the implementation of parallel quicksort. Quicksort itself is embarrassingly parallel, which means no synchronization is needed. The recursive calls of qs on a partitioned array is embarrassingly parallel.

不过,并行是在的递归的形式暴露出来。如果你只是使用的嵌套的并行OpenMP中,你将最终不得不在第二个千线程。没有加速将获得的。所以,大多是你需要把递归算法成一个交互之一。然后,你需要实现一种工作队列。这是你的方法。而且,这并不容易。

However, the parallelism is exposed in a recursive form. If you simply use the nested parallelism in OpenMP, you will end up having thousand threads in a second. No speedup will be gained. So, mostly you need to turn the recursive algorithm into an interative one. Then, you need to implement a sort of work-queue. This is your approach. And, it's not easy.

有关你的方法,有一个很好的标杆:OmpSCR。您可以下载 http://sourceforge.net/projects/ompscr/

For your approach, there is a good benchmark: OmpSCR. You can download at http://sourceforge.net/projects/ompscr/

在基准测试中,也有基于OpenMP的,快速排序的多个版本。他们大多是与你相似。然而,为了提高并行性,必须尽量减少在全球的队列争(在你的code,这是取值)。因此,有可能是几个优化的,如具有局部队列。虽然算法本身是纯粹的同时,实现可能需要同步工件。而且,最重要的是,这是很难获得的速度提升。

In the benchmark, there are several versions of OpenMP-based quicksort. Most of them are similar to yours. However, to increase parallelism, one must minimize the contention on a global queue (in your code, it's s). So, there could be a couple of optimizations such as having local queues. Although the algorithm itself is purely parallel, the implementation may require synchronization artifacts. And, most of all, it's very hard to gain speedups.

不过,你还是直接使用的OpenMP并行递归两种方式:(1)节流线程的总数,以及(2)使用的OpenMP 3.0的任务

However, you still directly use recursive parallelism in OpenMP in two ways: (1) Throttling the total number of the threads, and (2) Using OpenMP 3.0's task.

下面是第一种方法伪code(这仅基于OmpSCR基准):

Here is pseudo code for the first approach (This is only based on OmpSCR's benchmark):

void qsort_omp_recursive(int* begin, int* end)
{
  if (begin != end) {
    // Partition ...

    // Throttling
    if (...)  {
      qsort_omp_recursive(begin, middle);
      qsort_omp_recursive(++middle, ++end);
    } else {

#pragma omp parallel sections nowait
      {
#pragma omp section
        qsort_omp_recursive(begin, middle);
#pragma omp section
        qsort_omp_recursive(++middle, ++end);
      }
    }
  }
}

为了运行此code,你需要调用 omp_set_nested(1) OMP_SET_NUM_THREADS(2)。在code是非常简单的。我们只是产卵的分工两个线程。但是,我们插入一个简单的节流逻辑prevent过多的线程。请注意,我的实验结果显示体面的加速这种方法。

In order to run this code, you need to call omp_set_nested(1) and omp_set_num_threads(2). The code is really simple. We simply spawn two threads on the division of the work. However, we insert a simple throttling logic to prevent excessive threads. Note that my experimentation showed decent speedups for this approach.

最后,你可以使用的OpenMP 3.0的任务,其中一个任务是一个逻辑上的并发工作。在以上所有的OpenMP的方法,每个并行构造产卵两个物理的线程。你可能会说有一个任务到工作线程之间的硬1对1的映射。然而,任务分离逻辑任务和工人。

Finally, you may use OpenMP 3.0's task, where a task is a logically concurrent work. In the above all OpenMP's approaches, each parallel construct spawns two physical threads. You may say there is a hard 1-to-1 mapping between a task to a work thread. However, task separates logical tasks and workers.

由于OpenMP的3.0是不是流行呢,我将使用的的Cilk加的,这是伟大的前preSS这种嵌套和递归排比的。在的Cilk另外,并行是非常容易的:

Because OpenMP 3.0 is not popular yet, I will use Cilk Plus, which is great to express this kind of nested and recursive parallelisms. In Cilk Plus, the parallelization is extremely easy:

void qsort(int* begin, int* end)
{
  if (begin != end) {
    --end;
    int* middle = std::partition(begin, end,
      std::bind2nd(std::less<int>(), *end));
    std::swap(*end, *middle);

    cilk_spawn qsort(begin, middle);
    qsort(++middle, ++end);
    // cilk_sync; Only necessay at the final stage.
  }
}

我复制从Cilk的Plus的例子code此code。你会看到一个关键字 cilk_spawn 是一切并行快速排序。我跳过的Cilk Plus和产卵关键字的解释。然而,这是很容易理解:两个递归调用被声明为逻辑上的并发任务。每当递归发生时,被创建的逻辑任务。但是,在Cilk的加运行时(实现一个高效的工作窃取调度程序)会处理各种肮脏的工作。它优化排队并行任务,并映射到工作线程。

I copied this code from Cilk Plus' example code. You will see a single keyword cilk_spawn is everything to parallelize quicksort. I'm skipping the explanations of Cilk Plus and spawn keyword. However, it's easy to understand: the two recursive calls are declared as logically concurrent tasks. Whenever the recursion takes place, the logical tasks are created. But, the Cilk Plus runtime (which implements an efficient work-stealing scheduler) will handle all kinds of dirty job. It optimally queues the parallel tasks and maps to the work threads.

注意的OpenMP 3.0的任务基本上是类似的Cilk加的做法。我的实验表明,pretty不错的加速是可行的。我有一个8核的机器上的3〜4倍的速度提升。而且,加速了规模。的Cilk Plus的绝对的加速比的OpenMP的3.0的大。

Note that OpenMP 3.0's task is essentially similar to the Cilk Plus's approach. My experimentation shows that pretty nice speedups were feasible. I got a 3~4x speedup on a 8-core machine. And, the speedup was scale. Cilk Plus' absolute speedups are greater than those of OpenMP 3.0's.

Cilk的加(和OpenMP 3.0)和你的方法的方法在本质上是相同的:并行任务和工作负载分配的分离。然而,这是非常困难的有效实现。例如,你必须减少争用和使用无锁的数据结构。

The approach of Cilk Plus (and OpenMP 3.0) and your approach are essentially the same: the separation of parallel task and workload assignment. However, it's very difficult to implement efficiently. For example, you must reduce the contention and use lock-free data structures.

这篇关于çOpenMP并行快速排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆