在CUDA中将3D数组作为纹理进行读写 [英] 3D array writing and reading as texture in CUDA

查看:107
本文介绍了在CUDA中将3D数组作为纹理进行读写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于我正在编程的算法的性质,我需要用一些特定的数学编写/填充3D矩阵,然后从该矩阵中读取(在单独的内核中)作为3D线性插值纹理.

Due to the nature of the algorithm I am programming I need to write/fill a 3D matrix with some specific maths and then read from that matrix (in a separate kernel) as a 3D linearly interpolated texture.

由于纹理是一种读取模式,我假设我可以以某种方式在全局内存中写入到该纹理,并从该纹理中进行单独的读取,而无需双倍内存并将复制中的值复制到读取中矩阵.但是我似乎不知道如何做到这一点.

As texture is a reading mode, I am assuming I can somehow write in the global memory bind to the texture, and in a separate read from it, without the need of double memory and copying the values from the write to the read matrix. However I don't seem to figure out how to do this.

  • 如何将3D纹理内存用作读写(在单独的内核中)?

我的问题是我不知道如何定义此全局读/写数组.在下面的示例中,我创建了3D纹理,但这是使用带有 cudaExtent cudaArray 的代码的.但是我似乎无法使用这种类型在它们上书写,也似乎无法使用 float * 之类的东西来创建它们.

My problem is that I don't know how to define this global read/write array. In the sample below, I have created a 3D texture, but this is using code with cudaExtent and cudaArray. But I don't seem to be able to use this types to write on them, neither I seem to be able to create them with float* or the likes.

我可能无法执行此操作,并且需要在中间某处放置 memcpy ,但是由于这些数组通常很大,因此我想节省内存.

I may not be able to do this and need a memcpy somewhere in the middle, but as these arrays are generally big, I'd like to save memory.

示例代码(不编译,但明确定义了我要执行的操作的结构).默认使用100x100x100 3D内存,因为是.

Sample code (doesn't compile, but clearly defines the structure of what I am trying to do). Uses 100x100x100 3D memory as default because yes.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cuda_runtime_api.h>
#include <cuda.h>

#define MAXTREADS 1024

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
texture<float, cudaTextureType3D, cudaReadModeElementType> tex;

__global__ void readKernel(float* imageend )
{
    int indY = blockIdx.y * blockDim.y + threadIdx.y;
    int indX = blockIdx.x * blockDim.x + threadIdx.x;
    int indZ = blockIdx.z * blockDim.z + threadIdx.z;
    //Make sure we dont go out of bounds
    size_t idx = indZ * 100 * 100 + indY * 100 + indX;
    if (indX >= 100 | indY >= 100 | indZ >= 100)
        return;
    imageend[idx] = tex3D(tex, indX + 0.5, indY + 0.5, indZ + 0.5);

}
__global__ void writeKernel(float* imageaux){
    int indY = blockIdx.y * blockDim.y + threadIdx.y;
    int indX = blockIdx.x * blockDim.x + threadIdx.x;
    int indZ = blockIdx.z * blockDim.z + threadIdx.z;
    //Make sure we dont go out of bounds
    size_t idx = indZ * 100 * 100 + indY * 100 + indX;
    if (indX >= 100 | indY >= 100 | indZ >= 100)
        return;
    imageaux[idx] = (float)idx;

}
int main()
{

    cudaArray *d_image_aux= 0;
    const cudaExtent extent = make_cudaExtent(100, 100, 100);
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
    cudaMalloc3DArray(&d_image_aux, &channelDesc, extent);

    // Configure texture options
    tex.normalized = false;
    tex.filterMode = cudaFilterModeLinear;
    tex.addressMode[0] = cudaAddressModeBorder;
    tex.addressMode[1] = cudaAddressModeBorder;
    tex.addressMode[2] = cudaAddressModeBorder;

    cudaBindTextureToArray(tex, d_image_aux, channelDesc);

    float *d_image_end = 0;
    size_t num_bytes = 100 * 100 * 100 * sizeof(float);
    cudaMalloc((void**)&d_image_end, num_bytes);
    cudaMemset(d_image_end, 0, num_bytes);

    int divx, divy, divz; //Irrelevant for the demo, important for the main code
    divx = 32;
    divy = 32;
    divz = 1;
    dim3 grid((100 + divx - 1) / divx,
        (100 + divy - 1) / divy,
        (100 + divz - 1) / divz);
    dim3 block(divx, divy, divz);

    // Kernels
    writeKernel << <grid, block >> >(d_image_aux);
    readKernel  << <grid, block >> >(d_image_end);


    cudaUnbindTexture(tex);
    cudaFree(d_image_aux);
    cudaFree(d_image_end);

    return 0;
}

注意::我知道我不能写插值"或类似的内容.写操作将始终在整数索引中,而读操作则需要使用三线性插值.

NOTE: I am aware that I can not write "interpolated" or whatever that would be. The write operation will always be in integer indexes, while the read operation needs to use trilinear interpolation.

推荐答案

我相信所有必要的内容都可以证明内核写入3D表面(绑定到底层3D cudaArray),然后进行另一个内核纹理处理(即 volumeFiltering CUDA示例代码.

I believe all of the necessary pieces to demonstrate a kernel writing to a 3D surface (bound to an underlying 3D cudaArray), followed by another kernel texturing (i.e. with auto interpolation) from the same data (a 3D texture bound to the same underlying 3D cudaArray) are contained in the volumeFiltering CUDA sample code.

唯一的概念差异是示例代码具有2个不同的底层3D cudaArrays(一个用于纹理,一个用于表面),但是我们可以将它们组合在一起,以便随后在纹理化操作期间读取写入表面的数据.

The only conceptual difference is the sample code has 2 different underlying 3D cudaArrays (one for the texture, one for the surface) but we can combine these, so that the data written to the surface is subsequently read during the texturing operation.

这是一个完整的示例:

$ cat texsurf.cu
#include <stdio.h>
#include <helper_cuda.h>

texture<float, cudaTextureType3D, cudaReadModeElementType>  volumeTexIn;
surface<void,  3>                                    volumeTexOut;

__global__ void
surf_write(float *data,cudaExtent volumeSize)
{
    int x = blockIdx.x*blockDim.x + threadIdx.x;
    int y = blockIdx.y*blockDim.y + threadIdx.y;
    int z = blockIdx.z*blockDim.z + threadIdx.z;

    if (x >= volumeSize.width || y >= volumeSize.height || z >= volumeSize.depth)
    {
        return;
    }
    float output = data[z*(volumeSize.width*volumeSize.height)+y*(volumeSize.width)+x];
    // surface writes need byte offsets for x!
    surf3Dwrite(output,volumeTexOut,x * sizeof(float),y,z);

}

__global__ void
tex_read(float x, float y, float z){
    printf("x: %f, y: %f, z:%f, val: %f\n", x,y,z,tex3D(volumeTexIn,x,y,z));
}

void runtest(float *data, cudaExtent vol, float x, float y, float z)
{
    // create 3D array
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
    cudaArray_t content;
    checkCudaErrors(cudaMalloc3DArray(&content, &channelDesc, vol, cudaArraySurfaceLoadStore));

    // copy data to device
    float *d_data;
    checkCudaErrors(cudaMalloc(&d_data, vol.width*vol.height*vol.depth*sizeof(float)));
    checkCudaErrors(cudaMemcpy(d_data, data, vol.width*vol.height*vol.depth*sizeof(float), cudaMemcpyHostToDevice));

    dim3 blockSize(8,8,8);
    dim3 gridSize((vol.width+7)/8,(vol.height+7)/8,(vol.depth+7)/8);
    volumeTexIn.filterMode     = cudaFilterModeLinear;
    checkCudaErrors(cudaBindSurfaceToArray(volumeTexOut,content));
    surf_write<<<gridSize, blockSize>>>(d_data, vol);
    // bind array to 3D texture
    checkCudaErrors(cudaBindTextureToArray(volumeTexIn, content));
    tex_read<<<1,1>>>(x, y, z);
    checkCudaErrors(cudaDeviceSynchronize());
    cudaFreeArray(content);
    cudaFree(d_data);
    return;
}

int main(){
   const int dim = 8;
   float *data = (float *)malloc(dim*dim*dim*sizeof(float));
   for (int z = 0; z < dim; z++)
     for (int y = 0; y < dim; y++)
       for (int x = 0; x < dim; x++)
         data[z*dim*dim+y*dim+x] = z*100+y*10+x;
   cudaExtent vol = {dim,dim,dim};
   runtest(data, vol, 1.5, 1.5, 1.5);
   runtest(data, vol, 1.6, 1.6, 1.6);
   return 0;
}


$ nvcc -I/usr/local/cuda/samples/common/inc texsurf.cu -o texsurf
$ cuda-memcheck ./texsurf
========= CUDA-MEMCHECK
x: 1.500000, y: 1.500000, z:1.500000, val: 111.000000
x: 1.600000, y: 1.600000, z:1.600000, val: 122.234375
========= ERROR SUMMARY: 0 errors
$

我不会在这里尝试提供有关线性纹理过滤的完整教程.这里还有很多其他示例问题,它们涵盖了索引编制和过滤的详细信息,而且似乎并不是该问题的症结所在.我选择了点(1.5、1.5、1.5)和(1.6、1.6、1.6),以便于验证基础数据.结果对我来说很有意义.

I'm not going to try to give a full tutorial on linear texture filtering here. There are plenty of other example questions here which cover the details of indexing and filtering, and it doesn't seem to be the crux of this question. I've chosen the points (1.5, 1.5, 1.5) and (1.6, 1.6, 1.6) for easy verification of the underlying data; the results make sense to me.

这篇关于在CUDA中将3D数组作为纹理进行读写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆