在Cuda内核中创建向量 [英] Creating a vector in cuda kernel

查看:136
本文介绍了在Cuda内核中创建向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要创建某种堆栈来在cuda内核中爬行一棵树。我以为我可以使用推力:: device_vector但显然不能。

I need to create some sort of stack to crawl a tree within a cuda kernel. I thought I could use thrust::device_vector but apparently not. Is there an api for this or do I have to just code it myself.

__global__ 
void step_objects_kernel(ContainerNode* root, ObjectNode** objs, ObjectNode* new_objs, size_t n, real dt, real g)
{
    int idx = blockIdx.x * gridDim.x + threadIdx.x;
    if(idx >= n) return;

    thrust::device_vector<Node*> to_visit;
    to_visit.push_back(root);

    vec3 a = {0};
    while(!to_visit.empty())
    {
        Node* n = to_visit.back();
        to_visit.pop_back();

    }
}
error: calling a __host__ function("thrust::device_vector<Node *, thrust::device_malloc_allocator<Node *> > ::device_vector") from a __global__ function("step_objects_kernel") is not allowed


推荐答案

在CUDA设备代码中不能使用 thrust :: device_vector 是正确的。

It is correct that thrust::device_vector is not usable in CUDA device code.

我不知道任何类似CUDA发行版本身的内核容器类API。但是,如果四处搜索,您可能会发现许多可能有用/有趣的实现。像 trove 这样的低级库可能会为这种用例提供​​改进的性能。

I'm not aware of any in-kernel container-like API that is part of the CUDA distribution itself. If you search around, however, you will probably find dozens of possibly useful/interesting implementations. A lower level library like trove could possibly give improved performance for this kind of use-case.

在您的示例中,似乎每个线程将维护自己的堆栈或向量以跟踪树的遍历。 (我将在此处提供的方法取决于没有线程同时访问同一堆栈。如果需要从多个线程进行并发访问,则方法

In your example, it appears that each thread will maintain its own "stack" or "vector" to keep track of tree traversal. (The method I will offer here depends on not having threads concurrently accessing the same stack. If you need concurrent access from several threads, the method here may be of interest as a starting point.)

如果您知道这样的最大可能大小,则可以在此处作为起点。

If you know what the maximum probable size for such a stack would be, I would suggest allocating for it ahead of time, either a static (local) variable definition per-thread in-kernel, or a dynamic allocation e.g. via cudaMalloc. (I would not suggest in-kernel malloc for this, and I definitely would not suggest allocating/deallocating on-the-fly, for performance reasons.) The choice of which allocation method will give the most performance may depend on your actual test case. The coalescing rules (i.e. underlying storage method) are somewhat different for access to a global pointer vs. access to a local pointer. If your threads will tend to push or pop uniformly across a warp and as your code progresses, then either allocation method may give good performance. You can experiment with either approach.

这是您在示例中概述的堆栈方法的相当简单的部分工作示例,假设每个线程的最大堆栈大小是已知的一个先验。绝没有经过全面测试;我的目的是给您一些想法或起点。但是,如果发现错误,请随时指出它们,我将尝试解决它们。

Here's a fairly simple partially worked example of the "stack" methods you have outlined in your example, assuming the maximum stack size per thread is known a priori. It's by no means fully tested; my purpose is to give you some ideas or a starting point. However if you find errors, please feel free to point them out and I will try to address them.

$ cat t1082.cu

const size_t max_items = 256;


template <typename T>
class cu_st{  // simple implementation of "stack" function

  T *my_ptr;
  size_t n_items;
  size_t my_width;
  public:
    __host__ __device__
    cu_st(T *base, size_t id, size_t width=0){
      if (width == 0){ // "local" stack allocated
        my_ptr = base;
        my_width = 1;}
      else{            // "global" stack allocated
        my_ptr = base + id;
        my_width = width;}
      n_items = 0;}

    __host__ __device__
    int push_back(T &item){
      if (n_items < max_items){
        *my_ptr = item;
        my_ptr += my_width;
        n_items++;
        return 0;}
      return -1;}

    __host__ __device__
    T pop_back(){
      if (n_items > 0){
        n_items--;
        my_ptr -= my_width;}
      return *my_ptr;}

    __host__ __device__
    T back(){
      if (n_items > 0){
        return *(my_ptr-my_width);}
      return *my_ptr;}

    __host__ __device__
    bool empty(){
      return (n_items == 0);}

    __host__ __device__
    size_t size(){
      return n_items;}

    __host__ __device__
    size_t max_size(){
      return max_items;}

};

const size_t nTPB = 256;
const size_t nBLK = 256;

typedef int Node;

__global__
void kernel(Node **g_stack, size_t n)
{
    int idx = blockIdx.x * gridDim.x + threadIdx.x;
    if(idx >= n) return;
    Node *root = NULL;

    //method 1 - global stack
    cu_st<Node*> to_visit(g_stack, idx, gridDim.x*blockDim.x);
    to_visit.push_back(root);

    while(!to_visit.empty())
    {
        Node* n = to_visit.back();
        to_visit.pop_back();

    }

    //method 2 - local stack

    Node *l_stack[max_items];
    cu_st<Node*> l_to_visit(l_stack, idx);
    l_to_visit.push_back(root);

    while(!l_to_visit.empty())
    {
        Node* n = l_to_visit.back();
        l_to_visit.pop_back();

    }

}

int main(){

  Node **d_stack;
  cudaMalloc(&d_stack, nTPB*nBLK*max_items*sizeof(Node *));
  kernel<<<nBLK, nTPB>>>(d_stack, nTPB*nBLK);
  cudaDeviceSynchronize();
}
$ nvcc -o t1082 t1082.cu
$ cuda-memcheck ./t1082
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

注意:


  1. 除了您在此处看到的内容外,未经测试。我建议在按原样使用之前进行更多的验证。

  1. Other than what you see here, this code is not tested. I'd suggest doing more verification before using it as-is.

如代码中所见,基本上没有错误检查。

As you can see in the code, there is essentially no error checking.

这种随机访问通常会很慢,可能与您选择哪种分配方法无关。如果可能,请尽量减少使用此类堆栈。如果您知道每个线程的堆栈大小很小,也可以尝试使用带有 __ shared __ 内存分配的此构造。

This sort of random access will generally tend to be slow, probably regardless of which allocation method you choose. If possible, minimize your use of such a "stack". If you know that the stack size per thread is very small, you could also try experimenting with using this construct with a __shared__ memory allocation.

我这里没有演示的另一种分配方法是给每个线程一个全局分配,但是让线程连续地推送和弹出,而不是以我在这里所示的跨步方式进行(算法组合为我在这里概述的两种方法)。这种方法肯定会在统一情况下降低性能,但在某些随机访问模式下可能会提供更好的性能。

Another allocation approach which I have not demonstrated here would be to give each thread a global allocation but have the thread push and pop contiguously rather than in the strided fashion I have shown here (algorithmically a combination of the two methods I have outlined here). Such a method will definitely degrade performance in the "uniform" case, but may give better performance in some "random" access patterns.

这篇关于在Cuda内核中创建向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆