完全在芯片上制作 CUB blockradixsort? [英] Making CUB blockradixsort on-chip entirely?

查看:19
本文介绍了完全在芯片上制作 CUB blockradixsort?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 CUB 文档和示例:

I am reading the CUB documentations and examples:

#include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
    // Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
    // Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
    // Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
    // Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}

在示例中,每个线程有 4 个键.看起来thread_keys"将被分配到全局本地内存中.如果我每个线程只有 1 个密钥,我可以声明int thread_key;"吗?并仅在寄存器中创建此变量?

In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?

BlockRadixSort(temp_storage).Sort() 将指向键的指针作为参数.这是否意味着密钥必须在全局内存中?

BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?

我想使用此代码,但我希望每个线程在寄存器中保存一个键,并在排序后将其保存在芯片上的寄存器/共享内存中.提前致谢!

I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted. Thanks in advance!

推荐答案

您可以使用共享内存来做到这一点(这将保持它在芯片上").我不确定我是否知道如何在不解构 BlockRadixSort 对象的情况下使用严格的寄存器来做到这一点.

You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.

这是一个示例代码,它使用共享内存来保存要排序的初始数据和最终的排序结果.此示例主要针对每个线程的一个数据元素设置,因为这似乎是您所要求的.将它扩展到每个线程的多个元素并不难,而且我已经完成了大部分工作,除了数据合成和调试打印输出:

Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:

#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1

// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
    __shared__ int my_val[nTPB*ELEMS_PER_THREAD];
    using namespace cub;
    // Specialize BlockRadixSort collective types
    typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
    // Allocate shared memory for collectives
    __shared__ typename my_block_sort::TempStorage sort_temp_stg;

    // need to extend synthetic data for ELEMS_PER_THREAD > 1
    my_val[threadIdx.x*ELEMS_PER_THREAD]  = (threadIdx.x + 5)%nTPB; // synth data
    __syncthreads();
    printf("thread %d data = %d
", threadIdx.x,  my_val[threadIdx.x*ELEMS_PER_THREAD]);

    // Collectively sort the keys
    my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
    __syncthreads();

    printf("thread %d sorted data = %d
", threadIdx.x,  my_val[threadIdx.x*ELEMS_PER_THREAD]);
}

int main(){
    BlockSortKernel<<<1,nTPB>>>();
    cudaDeviceSynchronize();

}

这对我来说似乎工作正常,在这种情况下,我碰巧使用的是 RHEL 5.5/gcc 4.1.2、CUDA 6.0 RC 和 CUB v1.2.0(这是相当新的).

This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).

奇怪/丑陋的静态转换 据我所知需要,因为 CUB Sort期望引用长度等于自定义参数 ITEMS_PER_THREAD(即 ELEMS_PER_THREAD)的数组:

The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):

   __device__ __forceinline__ void Sort(
        Key     (&keys)[ITEMS_PER_THREAD],          
        int     begin_bit   = 0,                   
        int     end_bit     = sizeof(Key) * 8)      
   { ...

这篇关于完全在芯片上制作 CUB blockradixsort?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆