CUDA中的Bitonic排序将某些值置乱 [英] Bitonic sorting in cuda misorders some values

查看:62
本文介绍了CUDA中的Bitonic排序将某些值置乱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在针对较大的项目在CUDA上进行排序,因此我决定实施Bitonic排序.我将要排序的元素数始终是2的幂,实际上是512.我需要一个具有最终位置的数组,因为此方法将用于对表示另一个质量矩阵的数组进行排序解决方案.

i'm making a sorting algorithm on CUDA for a bigger project and i decided implementing a Bitonic sorting. The number of elements i'll be sorting will be allways a power of two, in fact will be 512. I need an array which will have the final positions because this method will be used for ordering an array that represents the quality matrix of another solution.

fitness是我要排序的数组,numElements是元素的数量,而orden最初是一个空的数组,具有numElements位置,将以这种方式在一开始就填充:orden[i]=i.其实orden与这个问题无关,但我保留了它.

fitness is the array i'll sort, numElements is the number of elements, and orden is initially an empty array with numElements positions which will be filled at the very beginning in this way: orden[i]=i. Actually orden is not relevant for this issue but I kept it.

我的问题是某些值没有正确排序,直到现在我一直无法弄清楚我遇到了什么问题.

My problem is that some values aren't sorted properly and until now i've been unable to figure out what problem do I have.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <ctime>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <device_functions.h>
#include "float.h"


__global__ void sorting(int * orden, float * fitness, int numElements);

// Populating array with random values for testing purposes
__global__ void populate( curandState * state, float * fitness{

    curandState localState = state[threadIdx.x];
    int a = curand(&localState) % 500;
    fitness[threadIdx.x] = a;
}

//Curand setup for the populate method 
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
    int id = threadIdx.x;
    curand_init(seed, id, 0, &state[id]);
}

int main()
{
    float * arrayx;
    int numelements = 512;
    int * orden;
    float arrayCPU[512] = { 0 };
    curandState * state;

    cudaDeviceReset();
    cudaSetDevice(0);
    cudaMalloc(&state, numelements * sizeof(curandState));
    cudaMalloc((void **)&arrayx, numelements*sizeof(float));
    cudaMalloc((void **)&orden, numelements*sizeof(int));






    setup_cuRand << <1, numelements >> >(state, unsigned(time(NULL)));

    populate << <1, numelements >> > (state, arrayx);
    cudaMemcpy(&arrayCPU, arrayx, numelements * sizeof(float), cudaMemcpyDeviceToHost);
    for (int i = 0; i < numelements; i++)
        printf("fitness[%i] = %f\n", i, arrayCPU[i]);

    sorting << <1, numelements >> >(orden, arrayx, numelements);
    printf("\n\n");

    cudaMemcpy(&arrayCPU, arrayx, numelements * sizeof(float), cudaMemcpyDeviceToHost);
    for (int i = 0; i < numelements; i++)
        printf("fitness[%i] = %f\n", i, arrayCPU[i]);



    cudaDeviceReset();


    return 0;
}
__device__ bool isValid(float n){
    return !(isnan(n) || isinf(n) || n != n || n <= FLT_MIN || n >= FLT_MAX);

}

__global__ void sorting(int * orden, float * fitness, int numElements){
    int i = 0;
    int j = 0;
    float f = 0.0;
    int aux = 0;

    //initial orden registered (1, 2, 3...)
    orden[threadIdx.x] = threadIdx.x;
    //Logarithm on base 2 of numElements
    for (i = 2; i <= numElements; i = i * 2){
        // descending from i reducing to half each iteration
        for (j = i; j >= 2; j = j / 2){

            if (threadIdx.x % j  < j / 2){
                __syncthreads();
                // ascending or descending consideration using (threadIdx.x % (i*2) < i) 
                if ((threadIdx.x % (i * 2) < i) && (fitness[threadIdx.x] >  fitness[threadIdx.x + j / 2] || !isValid(fitness[threadIdx.x])) ||
                    ((threadIdx.x % (i * 2) >= i) && (fitness[threadIdx.x] <= fitness[threadIdx.x + j / 2] || !isValid(fitness[threadIdx.x + j / 2])))){

                    aux = orden[threadIdx.x];
                    orden[threadIdx.x] = orden[threadIdx.x + j / 2];
                    orden[threadIdx.x + j / 2] = aux;
                    //Se reubican los fitness
                    f = fitness[threadIdx.x];
                    fitness[threadIdx.x] = fitness[threadIdx.x + j / 2];
                    fitness[threadIdx.x + j / 2] = f;
                }
            }
        }
    }
}

例如,我随机执行一个输出:

For example, an output i got on a random execution:

随机执行

这是我的双音排序的代表:

This is a representation of my bitonic sorting:

双音排序模式,箭头指向比较值最差的地方

Bitonic sorting Schema, the arrows point where the worst of the values compared goes to

推荐答案

以下是我发现的问题:

  1. 在您发布的代码中,该代码无法编译:

  1. In your posted code, this does not compile:

__global__ void populate( curandState * state, float * fitness{
                                                              ^
                                                   missing close parenthesis

我在那里添加了一个短括号.

I added a close parenthesis there.

在这些cudaMemcpy语句中不必获取数组的地址:

It's not necessary to take the address of the array in these cudaMemcpy statements:

cudaMemcpy(&arrayCPU, arrayx, numelements * sizeof(float), cudaMemcpyDeviceToHost);
....
cudaMemcpy(&arrayCPU, arrayx, numelements * sizeof(float), cudaMemcpyDeviceToHost);

数组名已经是数组的地址,所以我删除了&符.如果您使用动态分配的数组,则这种用法会被破坏.

the array name is already the address of the array, so I removed the ampersands. If you use a dynamically allocated array, such usage would be broken.

您在这里使用的__syncthreads()已损坏:

Your usage of __syncthreads() here is broken:

for (j = i; j >= 2; j = j / 2){

    if (threadIdx.x % j  < j / 2){
        __syncthreads();

条件语句中的__syncthreads()

用法通常是不正确的,除非条件语句在整个线程块中进行统一评估. 文档对此进行了介绍.稍作更改即可达到预期的效果:

usage of __syncthreads() inside a conditional statement is generally incorrect unless the conditional statement evaluates uniformly across the threadblock. This is covered in the documentation. We can achieve the desired effect with a slight change:

for (j = i; j >= 2; j = j / 2){
    __syncthreads();
    if (threadIdx.x % j  < j / 2){

在进行了上述更改之后,在大多数情况下,您的代码对于我来说似乎可以正常运行.如果您打算正确地排序0(或任何负值),则在有效性检查中使用FLT_MIN也会引起疑问.一般来说,FLT_MIN是一个非常小的数字,接近于零.如果您认为这是一个很大的负数,则不是.因此,零是您的随机数生成器的可能输出,并且不会正确排序.我将把这个问题留给您修复,它应该很简单,但这将取决于您最终想要实现的目标. (如果您只想对正的非零浮点值进行排序,则测试可能可以,但是在这种情况下,您的随机数生成器可以返回0.)

With the above changes, your code appears to run correctly for me, for most cases. Your usage of FLT_MIN in your validity check is also questionable, if you intend 0 (or any negative values) to be sorted correctly. Speaking generally, FLT_MIN is a number that is very small, close to zero. If you were thinking that this is a large negative number, it is not. As a result, zero is a possible output of your random number generator, and it will not be sorted correctly. I'll leave this one to you to fix, it should be straightforward, but it will depend on what you ultimately want to achieve. (If you only want to sort positive non-zero floating point values, the test may be OK, but in this case your random number generator can return 0.)

这篇关于CUDA中的Bitonic排序将某些值置乱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆