GPU 上用于动态结构数组的内存分配 [英] Memory allocation on GPU for dynamic array of structs

查看:19
本文介绍了GPU 上用于动态结构数组的内存分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在将结构数组传递给 gpu 内核时遇到问题.我基于这个主题 - cudaMemcpy 分段错误,我这样写:

I have problem with passing array of struct to gpu kernel. I based on this topic - cudaMemcpy segmentation fault and I wrote sth like this:

#include <stdio.h>
#include <stdlib.h>

struct Test {
    char *array;
};

__global__ void kernel(Test *dev_test) {
    for(int i=0; i < 5; i++) {
        printf("Kernel[0][i]: %c 
", dev_test[0].array[i]);
    }
}

int main(void) {

    int n = 4, size = 5;
    Test *dev_test, *test;

    test = (Test*)malloc(sizeof(Test)*n);
    for(int i = 0; i < n; i++)
        test[i].array = (char*)malloc(size * sizeof(char));

    for(int i=0; i < n; i++) {
        char temp[] = { 'a', 'b', 'c', 'd' , 'e' };
        memcpy(test[i].array, temp, size * sizeof(char));
    }

    cudaMalloc((void**)&dev_test, n * sizeof(Test));
    cudaMemcpy(dev_test, test, n * sizeof(Test), cudaMemcpyHostToDevice);
    for(int i=0; i < n; i++) {
        cudaMalloc((void**)&(test[i].array), size * sizeof(char));
        cudaMemcpy(&(dev_test[i].array), &(test[i].array), size * sizeof(char), cudaMemcpyHostToDevice);
    }

    kernel<<<1, 1>>>(dev_test);
    cudaDeviceSynchronize();

    //  memory free
    return 0;
}

没有错误,但内核中显示的值不正确.我做错了什么?提前感谢您的帮助.

There is no error, but displayed values in kernel are incorrect. What I'm doing wrong? Thank in advance for any help.

推荐答案

  1. 这是分配一个指向主机内存的新指针:

  1. This is allocating a new pointer to host memory:

 test[i].array = (char*)malloc(size * sizeof(char));

  • 这是将数据复制到主机内存中的那个区域:

  • This is copying data to that region in host memory:

     memcpy(test[i].array, temp, size * sizeof(char));
    

  • 这是覆盖先前分配的指向主机内存的指针(从上面的步骤 1 开始)用一个 new 指向设备内存的指针:

  • This is overwriting the previously allocated pointer to host memory (from step 1 above) with a new pointer to device memory:

     cudaMalloc((void**)&(test[i].array), size * sizeof(char));
    

  • 在第 3 步之后,您在第 2 步中设置的数据将完全丢失,并且无法再以任何方式访问.参考您链接的问题/答案中的步骤 3 和 4:

    After step 3, the data you set up in step 2 is entirely lost, and no longer accessible in any fashion. Referring to steps 3 and 4 in the question/answer you linked:

    3.在宿主机上创建一个单独的int指针,我们称之为myhostptr

    3.Create a separate int pointer on the host, let's call it myhostptr

    4.cudaMalloc int 存储在设备上用于myhostptr

    4.cudaMalloc int storage on the device for myhostptr

    你还没有这样做.您没有创建单独的指针.您重用(擦除、覆盖)现有指针,该指针指向主机上您关心的数据.这个问题/答案,也从您链接的答案中链接,几乎完全提供了您需要遵循的步骤,在代码中.

    You haven't done this. You did not create a separate pointer. You reused (erased, overwrote) an existing pointer, which was pointing to data you cared about on the host. This question/answer, also linked from the answer you linked, gives almost exactly the steps you need to follow, in code.

    这是您的代码的修改版本,它根据您链接的问题/答案正确实现了您没有正确实现的缺少的步骤 3 和 4(和 5):(请参阅描述步骤 3,4 的注释,5)

    Here's a modified version of your code, which properly implements the missing steps 3 and 4 (and 5) that you didn't implement correctly according to the question/answer you linked: (refer to comments delineating steps 3,4,5)

    $ cat t755.cu
    #include <stdio.h>
    #include <stdlib.h>
    
    struct Test {
        char *array;
    };
    
    __global__ void kernel(Test *dev_test) {
        for(int i=0; i < 5; i++) {
            printf("Kernel[0][i]: %c 
    ", dev_test[0].array[i]);
        }
    }
    
    int main(void) {
    
        int n = 4, size = 5;
        Test *dev_test, *test;
    
        test = (Test*)malloc(sizeof(Test)*n);
        for(int i = 0; i < n; i++)
            test[i].array = (char*)malloc(size * sizeof(char));
    
        for(int i=0; i < n; i++) {
            char temp[] = { 'a', 'b', 'c', 'd' , 'e' };
            memcpy(test[i].array, temp, size * sizeof(char));
        }
    
        cudaMalloc((void**)&dev_test, n * sizeof(Test));
        cudaMemcpy(dev_test, test, n * sizeof(Test), cudaMemcpyHostToDevice);
    
        // Step 3:
        char *temp_data[n];
        // Step 4:
        for (int i=0; i < n; i++)
          cudaMalloc(&(temp_data[i]), size*sizeof(char));
        // Step 5:
        for (int i=0; i < n; i++)
          cudaMemcpy(&(dev_test[i].array), &(temp_data[i]), sizeof(char *), cudaMemcpyHostToDevice);
        // now copy the embedded data:
        for (int i=0; i < n; i++)
          cudaMemcpy(temp_data[i], test[i].array, size*sizeof(char), cudaMemcpyHostToDevice);
    
        kernel<<<1, 1>>>(dev_test);
        cudaDeviceSynchronize();
    
        //  memory free
        return 0;
    }
    
    $ nvcc -o t755 t755.cu
    $ cuda-memcheck ./t755
    ========= CUDA-MEMCHECK
    Kernel[0][i]: a
    Kernel[0][i]: b
    Kernel[0][i]: c
    Kernel[0][i]: d
    Kernel[0][i]: e
    ========= ERROR SUMMARY: 0 errors
    $
    

    由于上述方法对于初学者来说可能具有挑战性,因此通常的建议是不要这样做,而是扁平化您的数据结构.Flatten一般是指对数据存储进行重新排列,以去除必须单独分配的嵌入指针.

    Since the above methodology can be challenging for beginners, the usual advice is not to do it, but instead flatten your data structures. Flatten generally means to rearrange the data storage so as to remove the embedded pointers that have to be separately allocated.

    扁平化此数据结构的一个简单示例是改用它:

    A trivial example of flattening this data structure would be to use this instead:

    struct Test {
        char array[5];
    };
    

    当然,这种特殊方法不能满足所有目的,但它应该说明一般的想法/意图.以这样的修改为例,代码变得更加简单:

    It's recognized of course that this particular approach would not serve every purpose, but it should illustrate the general idea/intent. With that modification, as an example, the code becomes much simpler:

    $ cat t755.cu
    #include <stdio.h>
    #include <stdlib.h>
    
    struct Test {
        char array[5];
    };
    
    __global__ void kernel(Test *dev_test) {
        for(int i=0; i < 5; i++) {
            printf("Kernel[0][i]: %c 
    ", dev_test[0].array[i]);
        }
    }
    
    int main(void) {
    
        int n = 4, size = 5;
        Test *dev_test, *test;
    
        test = (Test*)malloc(sizeof(Test)*n);
    
        for(int i=0; i < n; i++) {
            char temp[] = { 'a', 'b', 'c', 'd' , 'e' };
            memcpy(test[i].array, temp, size * sizeof(char));
        }
    
        cudaMalloc((void**)&dev_test, n * sizeof(Test));
        cudaMemcpy(dev_test, test, n * sizeof(Test), cudaMemcpyHostToDevice);
    
        kernel<<<1, 1>>>(dev_test);
        cudaDeviceSynchronize();
    
        //  memory free
        return 0;
    }
    $ nvcc -o t755 t755.cu
    $ cuda-memcheck ./t755
    ========= CUDA-MEMCHECK
    Kernel[0][i]: a
    Kernel[0][i]: b
    Kernel[0][i]: c
    Kernel[0][i]: d
    Kernel[0][i]: e
    ========= ERROR SUMMARY: 0 errors
    $
    

    这篇关于GPU 上用于动态结构数组的内存分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆