并行基数排序,这个实现如何实际工作?是否有一些启发式方法? [英] Parallel radix sort, how would this implementation actually work? Are there some heuristics?

查看:119
本文介绍了并行基数排序,这个实现如何实际工作?是否有一些启发式方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个Udacity测验用于他们的并行编程课程。我很抱歉我应该如何开始作业,因为我不知道如果我理解它正确。



对于赋值(在代码中),我们给出两个数组和数组以及一个位置数组。我们应该使用并行基数排序对数组数组进行排序,同时正确设置位置。



我完全理解radix排序以及它是如何工作的。我不明白是他们想要我们如何实现它。下面是开始赋值的模板

  // Udacity HW 4 
// Radix Sorting

#includereference_calc.cpp
#includeutils.h

/ *红眼去除
=========== ====

对于此分配,我们正在实施红眼消除。这是
,通过首先为每个像素创建一个分数,告诉我们
可能是一个红眼像素。我们已经为你做了 - 你
正在接收分数,需要按升序排序,以便我们
知道改变哪些像素以消除红眼。

注意:升序==最小到最大

每个分数与一个位置相关联,当您对分数进行排序时,必须
也相应地移动位置。

实现并行基数用CUDA
============================== ==========

基本思想是在每次通过时构造一个直方图,每个
digit有多少。然后我们扫描这个直方图,以便我们知道在哪里放置
每个数字的输出。例如,前1必须在所有
0之后,所以我们必须知道有多少0能够开始移动1s
到正确的位置。

1)每个数字的出现次数的直方图
2)独占前缀直方图的总和
3)确定每个数字的相对偏移
例如[0 0 1 1 0 0 1]
- > [0 1 0 1 2 3 2]
4)结合步骤2和4的结果。 3确定每个元素的最终
输出位置并将其移动

LSB基数排序是一种不合适的排序,您将需要乒乓值
在我们提供的输入和输出缓冲区之间。确保最后的
排序结果在输出缓冲区!提示:您可能需要在末尾做一个副本


* /


void your_sort(unsigned int * const d_inputVals,
unsigned int * const d_inputPos,
unsigned int * const d_outputVals ,
unsigned int * const d_outputPos,
const size_t numElems)
{

}

我特别不明白这4个步骤如何排序数组。



所以对于第一步,我应该创建一个直方图的数字(为什么是在引号..?)。所以给定一个输入值n我需要计算0和1的直方图。所以,应该第1步创建一个直方图数组,每个输入值一个?



很好,对于其余的步骤,它分解得很快。有人可以告诉我这些步骤应该如何实现基数排序?

解决方案

基数排序的基本思想是,我们将考虑每个元素逐位排序,从最低有效到最重要的。对于每个数字,我们将移动元素,以便这些数字按升序排列。



让我们举个简单的例子。让我们对四个量进行排序,每个量都有4个二进制数字。让我们选择1,4,7和14.我们将它们混合,并可视化二进制表示:

 元素# 1 2 3 4 
价值:7 14 4 1
二进制:0111 1110 0100 0001

首先,我们将考虑位0:

 元素#1 2 3 4 
值:7 14 4 1
二进制:0111 1110 0100 0001
位0:1 0 0 1

现在,基数排序算法说,我们必须以这样一种方式移动元素:(仅考虑位0)所有的零在左边,所有的都在右边。让我们这样做,同时保持元素的顺序为零位保留元素的顺序为一位。我们可以这样做:

 元素#2 3 1 4 
值:14 4 7 1
二进制:1110 0100 0111 0001
位0:0 0 1 1

的我们的基数排序是完整的。下一步是考虑下一个(二进制)数字:

 元素#3 2 1 4 
值:4 14 7 1
二进制:0100 1110 0111 0001
位1:0 1 1 0

再一次,我们必须移动元素,以使所讨论的数字(位1)按升序排列:

 元素#3 4 2 1 
值:4 1 14 7
二进制:0100 0001 1110 0111
位1:0 0 1 1
pre>

现在我们必须移动到下一个高位:

 元素#3 4 2 1 
值:4 1 14 7
二进制:0100 0001 1110 0111
位2:1 0 1 1
pre>

并移动它们:

  Element#4 3 2 1 
值:1 4 14 7
二进制:0001 0100 1110 0111
位2:0 1 1 1

现在我们移动到最后一个(最高位)数字:

  3 2 1 
值:1 4 14 7
二进制:0001 0100 1110 0111
bit 3:0 0 1 0

并做最后一步:

 元素4 3 1 2 
值:1 4 7 14
二进制:0001 0100 0111 1110
bit 3:0 0 0 1

现在对值进行排序。这希望看起来很清楚,但在目前的描述中,我们已经掩饰了诸如我们如何知道哪些元素移动?和我们怎么知道把它们放在哪里?因此,让我们重复我们的例子,但我们将使用提示中建议的特定方法和顺序,以回答这些问题。从位0开始:

 元素#1 2 3 4 
值:7 14 4 1
二进制:0111 1110 0100 0001
bit 0:1 0 0 1

位0位置中的零位数量的直方图和位位置中的零位数量:

  bit 0:1 0 0 1 

零位一位
--------- --------
1)直方图:2 2



现在让我们对这些直方图值进行独占前缀和:

 零位一位
--------- --------
1 )histogram:2 2
2)prefix sum:0 2

只是所有先前值的和。在第一位置中没有先前值,并且在第二位置,前一值是2(在位0位置具有0位的元素的数量)。现在,作为一个独立的操作,让我们确定每个0位在所有零位之间的相对偏移量,以及所有一位中的每一位:

  bit 0:1 0 0 1 
3)offset:0 0 1 1


$ b b

这实际上可以通过使用独占前缀总和再次进行编程,分别考虑0组和1组,并将每个位置视为值为1:

  0位0:1 1 
3) psum:0 1

1位0:1 1
3) psum:0 1

现在,给定算法的第4步说:


4)结合步骤2和4的结果。 3来确定每个元素的最终输出位置并将其移动到


这意味着,对于每个元素,我们将选择直方图框前缀总和值对应于其位值(0或1),并且添加与其位置相关联的偏移量,以确定将该元素移动到的位置:

 元素#1 2 3 4 
值:7 14 4 1
二进制:0111 1110 0100 0001
位0:1 0 0 1
hist psum:2 0 0 2
offset:0 0 1 1
new index:2 0 1 3

将每个元素移动到其新索引位置,我们有:

  2 3 1 4 
价值:14 4 7 1
二进制:0111 1110 0111 0001

这正是我们根据前一个步骤完成第一个数字移动的结果。这已经完成步骤1,即第一(最低有效)数字;



注意:



我们仍然有剩余的数字要处理,在每个步骤创建一个新的直方图和新的前缀总和。


  1. 基数排序,即使在计算机中,也不必严格按照二进制数字进行。可以构造具有不同大小的数字的类似算法,可能由2,3或4位组成。

  2. 我们可以对基数排序执行的优化之一是仅基于实际有意义的数字数量进行排序。例如,如果我们将数量存储在32位值中,但是我们知道存在的最大数量是1023(2 ^ 10-1),我们不需要在所有32位上进行排序。

  3. 这与GPU有什么关系?我们可以停止,期待一个正确的排序。就上述描述而言,不多。实际应用是考虑使用并行算法来处理直方图,前缀和和数据移动。基数分解的这种分解允许定位和使用已经为这些更基本的操作开发的并行算法,以便构建快速并行排序。

下面是一个工作示例。这可能有助于您理解基数排序。我不认为这将有助于你的赋值,因为这个例子执行一个32位的基数排序在warp层,对于一个warp,ie。共32个数量。但是从理解的角度来看的一个可能的优点是,像直方图和前缀总和这样的东西可以在仅仅几个指令的warp层次,利用各种CUDA内在性。对于您的作业,您将无法使用这些技术,您需要提供可以对任意数据集大小进行操作的全功能并行前缀总和,直方图等。

  #include< stdio.h> 
#include< stdlib.h>
#define WSIZE 32
#define LOOPS 100000
#define UPPER_BIT 31
#define LOWER_BIT 0

__device__ unsigned int ddata [WSIZE];

// naive warp-level按位排序

__global__ void mykernel(){
__shared__ volatile unsigned int sdata [WSIZE * 2];
//从全局加载到共享变量
sdata [threadIdx.x] = ddata [threadIdx.x];
unsigned int bitmask = 1<< LOWER_BIT;
unsigned int offset = 0;
unsigned int thrmask = 0xFFFFFFFFU<< threadIdx.x;
unsigned int mypos;
// for each LSB to MSB
for(int i = LOWER_BIT; i <= UPPER_BIT; i ++){
unsigned int mydata = sdata [((WSIZE-1)-threadIdx。 x)+ offset];
unsigned int mybit = mydata& bitmask;
//获取一和零的总数(cc 2.0投票)
unsigned int ones = __ballot(mybit); // cc 2.0
unsigned int zeroes =〜ones;
offset ^ = WSIZE; //切换乒乓缓冲区
//执行零,然后执行
if(!mybit)//线程为零位
//获取我在乒乓缓冲区中的位置
mypos = __popc(zeroes& thrmask);
else //带有一个位的线程
//在乒乓缓冲区中获取我的位置
mypos = __popc(zeroes)+ __ popc(ones& thrmask);
//移动到缓冲区(或使用shfl的cc 3.0)
sdata [mypos-1 + offset] = mydata;
//对于下一位重复
位掩码<= 1;
}
//将结果保存到全局
ddata [threadIdx.x] = sdata [threadIdx.x + offset];
}

int main(){

unsigned int hdata [WSIZE];
for(int lcount = 0; lcount unsigned int range = 1U<<< UPPER_BIT;
for(int i = 0; i cudaMemcpyToSymbol(ddata,hdata,WSIZE * sizeof(unsigned int));
mykernel<<< 1,WSIZE>>>();
cudaMemcpyFromSymbol(hdata,ddata,WSIZE * sizeof(unsigned int));
for(int i = 0; i hdata [i + 1]){printf(循环中的排序错误%d,hdata [%d ] =%d,hdata [%d] =%d \\\
,lcount,i,hdata [i],i + 1,hdata [i + 1] return 1;}
// printf(sorted data:\\\
);
// for(int i = 0; i ,hdata [i]);
}
printf(Success!\\\
);
return 0;
}


I am working on an Udacity quiz for their parallel programming course. I am pretty stuck on how I should start on the assignment because I am not sure if I understand it correctly.

For the assignment (in code) we are given two arrays and array on values and an array of positions. We are supposed to sort the array of values with a parallelized radix sort, along with setting the positions correctly too.

I completely understand radix sort and how it works. What I don't understand is how they want us to implemented it. Here is the template given to start the assignment

//Udacity HW 4
//Radix Sorting

#include "reference_calc.cpp"
#include "utils.h"

/* Red Eye Removal
   ===============

   For this assignment we are implementing red eye removal.  This is
   accomplished by first creating a score for every pixel that tells us how
   likely it is to be a red eye pixel.  We have already done this for you - you
   are receiving the scores and need to sort them in ascending order so that we
   know which pixels to alter to remove the red eye.

   Note: ascending order == smallest to largest

   Each score is associated with a position, when you sort the scores, you must
   also move the positions accordingly.

   Implementing Parallel Radix Sort with CUDA
   ==========================================

   The basic idea is to construct a histogram on each pass of how many of each
   "digit" there are.   Then we scan this histogram so that we know where to put
   the output of each digit.  For example, the first 1 must come after all the
   0s so we have to know how many 0s there are to be able to start moving 1s
   into the correct position.

   1) Histogram of the number of occurrences of each digit
   2) Exclusive Prefix Sum of Histogram
   3) Determine relative offset of each digit
        For example [0 0 1 1 0 0 1]
                ->  [0 1 0 1 2 3 2]
   4) Combine the results of steps 2 & 3 to determine the final
      output location for each element and move it there

   LSB Radix sort is an out-of-place sort and you will need to ping-pong values
   between the input and output buffers we have provided.  Make sure the final
   sorted results end up in the output buffer!  Hint: You may need to do a copy
   at the end.

 */


void your_sort(unsigned int* const d_inputVals,
               unsigned int* const d_inputPos,
               unsigned int* const d_outputVals,
               unsigned int* const d_outputPos,
               const size_t numElems)
{

}

I specifically don't understand how those 4 steps end up sorting the array.

So for the first step, I am supposed to create a histogram of the "digits" (why is that in quotes..?). So given a input value n I need to make a count of the 0's and 1's into a histogram. So, should step 1 create an array of histograms, one for each input value?

And well, for the rest of the steps it breaks down pretty quickly. Could someone show me how these steps are supposed to implement a radix sort?

解决方案

The basic idea behind a radix sort is that we will consider each element to be sorted digit by digit, from least significant to most significant. For each digit, we will move the elements so that those digits are in increasing order.

Let's take a really simple example. Let's sort four quantities, each of which have 4 binary digits. Let's choose 1, 4, 7, and 14. We'll mix them up and also visualize the binary representation:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001

First we will consider bit 0:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1

Now the radix sort algorithm says we must move the elements in such a way that (considering only bit 0) all the zeroes are on the left, and all the ones are on the right. Let's do this while preserving the order of the elements with a zero bit and preserving the order of the elements with a one bit. We could do that like this:

Element #    2       3       1       4
Value:       14      4       7       1
Binary:      1110    0100    0111    0001
bit 0:       0       0       1       1

The first step of our radix sort is complete. The next step is to consider the next (binary) digit:

Element #    3       2       1       4
Value:       4       14      7       1
Binary:      0100    1110    0111    0001
bit 1:       0       1       1       0

Once again, we must move elements so that the digit in question (bit 1) is arranged in ascending order:

Element #    3       4       2       1
Value:       4       1       14      7
Binary:      0100    0001    1110    0111
bit 1:       0       0       1       1

Now we must move to the next higher digit:

Element #    3       4       2       1
Value:       4       1       14      7
Binary:      0100    0001    1110    0111
bit 2:       1       0       1       1

And move them again:

Element #    4       3       2       1
Value:       1       4       14      7
Binary:      0001    0100    1110    0111
bit 2:       0       1       1       1

Now we move to the last (highest order) digit:

Element #    4       3       2       1
Value:       1       4       14      7
Binary:      0001    0100    1110    0111
bit 3:       0       0       1       0

And make our final move:

Element #    4       3       1       2
Value:       1       4       7       14
Binary:      0001    0100    0111    1110
bit 3:       0       0       0       1

And the values are now sorted. This hopefully seems clear, but in the description so far we've glossed over the details of things like "how do we know which elements to move?" and "how do we know where to put them?" So let's repeat our example, but we'll use the specific methods and sequence suggested in the prompt, in order to answer these questions. Starting over with bit 0:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1

First let's build a histogram of the number of zero bits in bit 0 position, and the number of 1 bits in bit 0 position:

bit 0:       1       0       0       1

              zero bits       one bits
              ---------       --------
1)histogram:         2              2

Now let's do an exclusive prefix-sum on these histogram values:

              zero bits       one bits
              ---------       --------
1)histogram:         2              2
2)prefix sum:        0              2

An exclusive prefix-sum is just the sum of all preceding values. There are no preceding values in the first position, and in the second position the preceding value is 2 (the number of elements with a 0 bit in bit 0 position). Now, as an independent operation, let's determine the relative offset of each 0 bit amongst all the zero bits, and each one bit amongst all the one bits:

bit 0:       1       0       0       1
3)offset:    0       0       1       1

This can actually be done programmatically using exclusive prefix-sums again, considering the 0-group and 1-group separately, and treating each position as if it has a value of 1:

0 bit 0:             1       1       
3)ex. psum:          0       1    

1 bit 0:     1                        1      
3)ex. psum:  0                        1   

Now, step 4 of the given algorithm says:

4) Combine the results of steps 2 & 3 to determine the final output location for each element and move it there

What this means is, for each element, we will select the histogram-bin prefix sum value corresponding to its bit value (0 or 1) and add to that, the offset associated with its position, to determine the location to move that element to:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1
hist psum:   2       0       0       2
offset:      0       0       1       1
new index:   2       0       1       3

Moving each element to its "new index" position, we have:

Element #    2       3       1       4
Value:       14      4       7       1
Binary:      0111    1110    0111    0001

Which is exactly the result we expect for the completion of our first digit-move, based on the previous walk-through. This has completed step 1, i.e. the first (least-significant) digit; we still have the remaining digits to process, creating a new histogram and new prefix sums at each step.

Notes:

  1. Radix-sort, even in a computer, does not have to be done based strictly on binary digits. It's possible to construct a similar algorithm with digits of different sizes, perhaps consisting of 2,3, or 4 bits.
  2. One of the optimizations we can perform on a radix sort is to only sort based on the number of digits that are actually meaningful. For example, if we are storing quantities in 32-bit values, but we know that the largest quantity present is 1023 (2^10-1), we need not sort on all 32 bits. We can stop, expecting a proper sort, after proceeding through the first 10 bits.
  3. What does any of this have to do with GPUs? In so far as the above description goes, not much. The practical application is to consider using parallel algorithms for things like the histogram, the prefix-sums, and the data movement. This decomposition of radix-sort allows one to locate and use parallel algorithms already developed for these more basic operations, in order to construct a fast parallel sort.

What follows is a worked example. This may help with your understanding of radix sort. I don't think it will help with your assignment, because this example performs a 32-bit radix sort at the warp level, for a single warp, ie. for 32 quantities. But a possible advantage from an understanding point of view is that things like histogramming and prefix sums can be done at the warp level in just a few instructions, taking advantage of various CUDA intrinsics. For your assignment, you won't be able to use these techniques, and you will need to come up with full-featured parallel prefix sums, histograms, etc. that can operate on an arbitrary dataset size.

#include <stdio.h>
#include <stdlib.h>
#define WSIZE 32
#define LOOPS 100000
#define UPPER_BIT 31
#define LOWER_BIT 0

__device__ unsigned int ddata[WSIZE];

// naive warp-level bitwise radix sort

__global__ void mykernel(){
  __shared__ volatile unsigned int sdata[WSIZE*2];
  // load from global into shared variable
  sdata[threadIdx.x] = ddata[threadIdx.x];
  unsigned int bitmask = 1<<LOWER_BIT;
  unsigned int offset  = 0;
  unsigned int thrmask = 0xFFFFFFFFU << threadIdx.x;
  unsigned int mypos;
  //  for each LSB to MSB
  for (int i = LOWER_BIT; i <= UPPER_BIT; i++){
    unsigned int mydata = sdata[((WSIZE-1)-threadIdx.x)+offset];
    unsigned int mybit  = mydata&bitmask;
    // get population of ones and zeroes (cc 2.0 ballot)
    unsigned int ones = __ballot(mybit); // cc 2.0
    unsigned int zeroes = ~ones;
    offset ^= WSIZE; // switch ping-pong buffers
    // do zeroes, then ones
    if (!mybit) // threads with a zero bit
      // get my position in ping-pong buffer
      mypos = __popc(zeroes&thrmask);
    else        // threads with a one bit
      // get my position in ping-pong buffer
      mypos = __popc(zeroes)+__popc(ones&thrmask);
    // move to buffer  (or use shfl for cc 3.0)
    sdata[mypos-1+offset] = mydata;
    // repeat for next bit
    bitmask <<= 1;
    }
  // save results to global
  ddata[threadIdx.x] = sdata[threadIdx.x+offset];
  }

int main(){

  unsigned int hdata[WSIZE];
  for (int lcount = 0; lcount < LOOPS; lcount++){
    unsigned int range = 1U<<UPPER_BIT;
    for (int i = 0; i < WSIZE; i++) hdata[i] = rand()%range;
    cudaMemcpyToSymbol(ddata, hdata, WSIZE*sizeof(unsigned int));
    mykernel<<<1, WSIZE>>>();
    cudaMemcpyFromSymbol(hdata, ddata, WSIZE*sizeof(unsigned int));
    for (int i = 0; i < WSIZE-1; i++) if (hdata[i] > hdata[i+1]) {printf("sort error at loop %d, hdata[%d] = %d, hdata[%d] = %d\n", lcount,i, hdata[i],i+1, hdata[i+1]); return 1;}
    // printf("sorted data:\n");
    //for (int i = 0; i < WSIZE; i++) printf("%u\n", hdata[i]);
    }
  printf("Success!\n");
  return 0;
}

这篇关于并行基数排序,这个实现如何实际工作?是否有一些启发式方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆