使芯片上的CUB blockradixsort完全? [英] Making CUB blockradixsort on-chip entirely?

查看:483
本文介绍了使芯片上的CUB blockradixsort完全?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读CUB文档和示例:

I am reading the CUB documentations and examples:

#include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
    // Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
    // Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
    // Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
    // Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}

在示例中,每个线程都有4个键。看来'thread_keys'将在全局本地内存中分配。如果我每个线程只有一个键,我可以声明int thread_key;并使这个变量只在寄存器?

In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?

BlockRadixSort(temp_storage).Sort()将指针指向键作为参数。是否意味着键必须在全局内存中?

BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?

我想使用这个代码,但我想让每个线程在寄存器中保存一个键,并在排序后将它们保存在寄存器/共享内存中。
提前感谢!

I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted. Thanks in advance!

推荐答案

您可以使用共享内存(这将保持片上 )。我不知道我是否知道如何使用严格的寄存器而不解构 BlockRadixSort 对象。

You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.

下面是一个使用共享内存保存要排序的初始数据和最终排序结果的示例代码。这个示例大多设置为每个线程一个数据元素,因为这似乎是你要求的。每个线程将其扩展到多个元素并不困难,除了数据合成和调试打印输出之外,我已经将大部分管道安装到位:

Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:

#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1

// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
    __shared__ int my_val[nTPB*ELEMS_PER_THREAD];
    using namespace cub;
    // Specialize BlockRadixSort collective types
    typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
    // Allocate shared memory for collectives
    __shared__ typename my_block_sort::TempStorage sort_temp_stg;

    // need to extend synthetic data for ELEMS_PER_THREAD > 1
    my_val[threadIdx.x*ELEMS_PER_THREAD]  = (threadIdx.x + 5)%nTPB; // synth data
    __syncthreads();
    printf("thread %d data = %d\n", threadIdx.x,  my_val[threadIdx.x*ELEMS_PER_THREAD]);

    // Collectively sort the keys
    my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
    __syncthreads();

    printf("thread %d sorted data = %d\n", threadIdx.x,  my_val[threadIdx.x*ELEMS_PER_THREAD]);
}

int main(){
    BlockSortKernel<<<1,nTPB>>>();
    cudaDeviceSynchronize();

}

这似乎对我来说工作正常,正在使用RHEL 5.5 / gcc 4.1.2,CUDA 6.0 RC和CUB v1.2.0(这是最近的)。

This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).

奇怪/丑陋静态转换是需要的,只要我能告诉,因为CUB 排序期望参考到长度等于定制参数 ITEMS_PER_THREAD (即 ELEMS_PER_THREAD )的数组

The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):

   __device__ __forceinline__ void Sort(
        Key     (&keys)[ITEMS_PER_THREAD],          
        int     begin_bit   = 0,                   
        int     end_bit     = sizeof(Key) * 8)      
   { ...

这篇关于使芯片上的CUB blockradixsort完全?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆