CUDA:如何创建二维纹理对象? [英] CUDA: how to create 2D texture object?

查看:37
本文介绍了CUDA:如何创建二维纹理对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建 2D 纹理对象,4x4 uint8_t.代码如下:

I'm trying to create 2D texture object, 4x4 uint8_t. Here is the code:

__global__ void kernel(cudaTextureObject_t tex)
{
    int x = threadIdx.x;
    int y = threadIdx.y;
    uint8_t val = tex2D<uint8_t>(tex, x, y);
    printf("%d, ", val);
    return;
}

int main(int argc, char **argv)
{
    cudaTextureObject_t tex;
    uint8_t dataIn[16] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
    uint8_t* dataDev = 0;
    cudaMalloc((void**)&dataDev, 16);
    struct cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypePitch2D;
    resDesc.res.pitch2D.devPtr = dataDev;
    resDesc.res.pitch2D.desc.x = 8;
    resDesc.res.pitch2D.desc.y = 8;
    resDesc.res.pitch2D.desc.f = cudaChannelFormatKindUnsigned;
    resDesc.res.pitch2D.width = 4;
    resDesc.res.pitch2D.height = 4;
    resDesc.res.pitch2D.pitchInBytes = 4;
    struct cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
    cudaMemcpy(dataDev, &dataIn[0], 16, cudaMemcpyHostToDevice);
    dim3 threads(4, 4);
    kernel<<<1, threads>>>(tex);
    cudaDeviceSynchronize();
    return 0;
}

我希望结果是这样的:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,

即纹理对象的所有值(顺序无关紧要).

i.e. all values of the texture object (order doesn't matter).

但实际结果是:

0, 2, 4, 6, 0, 2, 4, 6, 0, 2, 4, 6, 0, 2, 4, 6,     

我做错了什么?

推荐答案

当您使用 pitch2D 变体进行纹理操作时,底层分配应该是适当的pitched 分配.我认为通常人们会使用 cudaMallocPitch 来创建它.但是

When you use the pitch2D variant for the texture operation, the underlying allocation is supposed to be a proper pitched allocation. I think typically people would create this with cudaMallocPitch. However the requirement stated is:

cudaResourceDesc::res::pitch2D::pitchInBytes 以字节为单位指定两行之间的间距,并且必须与 cudaDeviceProp::texturePitchAlignment 对齐.

cudaResourceDesc::res::pitch2D::pitchInBytes specifies the pitch between two rows in bytes and has to be aligned to cudaDeviceProp::texturePitchAlignment.

在我的 GPU 上,最后一个属性是 32.我不知道您的 GPU,但我敢打赌,您的 GPU 的属性不是 4.但是,您在此处指定了 4:

On my GPU, that last property is 32. I don't know about your GPU, but I bet that property is not 4 for your GPU. However you are specifying 4 here:

resDesc.res.pitch2D.pitchInBytes = 4;

同样,我认为人们通常会为此使用由 cudaMallocPitch 创建的倾斜分配.但是,如果行到行维度(以字节为单位)可以被 texturePitchAlignment(在我的情况下为 32)整除,我似乎确实可以传递普通的线性分配.

Again, I think people would typically use a pitched allocation created with cudaMallocPitch for this. However it does appear to be possible to me to pass an ordinary linear allocation if the row-to-row dimension (in bytes) is divisible by texturePitchAlignment (32 in my case).

我所做的另一个更改是使用 cudaCreateChannelDesc<>() 而不是像您那样手动设置参数.这会创建一组不同的 desc 参数,并且似乎也会影响结果.研究这些差异应该不难.

Another change I made is to use cudaCreateChannelDesc<>() instead of manually setting the parameters like you did. This creates a different set of desc parameters and seems to affect the result also. It should not be difficult to study the differences.

当我调整您的代码以解决这些问题时,我得到的结果对我来说似乎很合理:

When I adjust your code to address those issues, I get results that seem sensible to me:

$ cat t30.cu
#include <stdio.h>
#include <stdint.h>

typedef uint8_t mt;  // use an integer type

__global__ void kernel(cudaTextureObject_t tex)
{
    int x = threadIdx.x;
    int y = threadIdx.y;
    mt val = tex2D<mt>(tex, x, y);
    printf("%d, ", val);
}

int main(int argc, char **argv)
{
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    printf("texturePitchAlignment: %lu\n", prop.texturePitchAlignment);
    cudaTextureObject_t tex;
    const int num_rows = 4;
    const int num_cols = prop.texturePitchAlignment*1; // should be able to use a different multiplier here
    const int ts = num_cols*num_rows;
    const int ds = ts*sizeof(mt);
    mt dataIn[ds];
    for (int i = 0; i < ts; i++) dataIn[i] = i;
    mt* dataDev = 0;
    cudaMalloc((void**)&dataDev, ds);
    cudaMemcpy(dataDev, dataIn, ds, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypePitch2D;
    resDesc.res.pitch2D.devPtr = dataDev;
    resDesc.res.pitch2D.width = num_cols;
    resDesc.res.pitch2D.height = num_rows;
    resDesc.res.pitch2D.desc = cudaCreateChannelDesc<mt>();
    resDesc.res.pitch2D.pitchInBytes = num_cols*sizeof(mt);
    struct cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
    dim3 threads(4, 4);
    kernel<<<1, threads>>>(tex);
    cudaDeviceSynchronize();
    printf("\n");
    return 0;
}
$ nvcc -o t30 t30.cu
$ cuda-memcheck ./t30
========= CUDA-MEMCHECK
texturePitchAlignment: 32
0, 1, 2, 3, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, 99,
========= ERROR SUMMARY: 0 errors
$

如评论中所问,如果我打算做类似的事情,但使用 cudaMallocPitchcudaMemcpy2D,它可能看起来像这样:

As asked in the comments, if I were going to do something similar to this but using cudaMallocPitch and cudaMemcpy2D, it could look something like this:

$ cat t1421.cu
#include <stdio.h>
#include <stdint.h>

typedef uint8_t mt;  // use an integer type

__global__ void kernel(cudaTextureObject_t tex)
{
    int x = threadIdx.x;
    int y = threadIdx.y;
    mt val = tex2D<mt>(tex, x, y);
    printf("%d, ", val);
}

int main(int argc, char **argv)
{
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    printf("texturePitchAlignment: %lu\n", prop.texturePitchAlignment);
    cudaTextureObject_t tex;
    const int num_rows = 4;
    const int num_cols = prop.texturePitchAlignment*1; // should be able to use a different multiplier here
    const int ts = num_cols*num_rows;
    const int ds = ts*sizeof(mt);
    mt dataIn[ds];
    for (int i = 0; i < ts; i++) dataIn[i] = i;
    mt* dataDev = 0;
    size_t pitch;
    cudaMallocPitch((void**)&dataDev, &pitch,  num_cols*sizeof(mt), num_rows);
    cudaMemcpy2D(dataDev, pitch, dataIn, num_cols*sizeof(mt), num_cols*sizeof(mt), num_rows, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypePitch2D;
    resDesc.res.pitch2D.devPtr = dataDev;
    resDesc.res.pitch2D.width = num_cols;
    resDesc.res.pitch2D.height = num_rows;
    resDesc.res.pitch2D.desc = cudaCreateChannelDesc<mt>();
    resDesc.res.pitch2D.pitchInBytes = pitch;
    struct cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
    dim3 threads(4, 4);
    kernel<<<1, threads>>>(tex);
    cudaDeviceSynchronize();
    printf("\n");
    return 0;
}
$ nvcc -o t1421 t1421.cu
$ cuda-memcheck ./t1421
========= CUDA-MEMCHECK
texturePitchAlignment: 32
0, 1, 2, 3, 32, 33, 34, 35, 64, 65, 66, 67, 96, 97, 98, 99,
========= ERROR SUMMARY: 0 errors
$

虽然我们这里有的是纹理对象,但很容易证明纹理引用会出现类似的问题.您不能创建任意小的 2D 纹理参考,就像您不能创建任意小的 2D 纹理对象一样.我也不打算对此进行演示,因为它会在很大程度上重复上述内容,并且人们不应再在新的开发工作中使用纹理引用 - 纹理对象是更好的方法.

Although what we have here are texture objects, its easy enough to demonstrate that similar issues occur with texture references. You cannot create an arbitrarily small 2D texture reference just as you cannot create an arbitrarily small 2D texture object. I'm not going to provide a demonstration of that also, as it would largely duplicate the above, and folks shouldn't be using texture references anymore for new development work - texture objects are the better approach.

这篇关于CUDA:如何创建二维纹理对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆