使用结构作为缓冲支架 [英] Using structure as buffer holder

查看：128 发布时间：2020/5/20 19:04:04 c++ c opencl

本文介绍了使用结构作为缓冲支架的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我当前的OpenCL实现中，我想节省参数的时间，避免每次我想在内核中使用缓冲区并为我的内核使用较短的参数列表时都传递它们.

In my current OpenCL implementation, I wanted to save time with arguments, avoid to pass them every time I wanted to use a buffer inside a kernel and have a shorter argument list for my kernel.

因此，我创建了一个结构(工作区)，该结构将指向缓冲区的指针保存在设备内存中，该结构的行为就像一个具有成员变量的对象，您想要随时访问它，并且希望在整个执行过程中保持生命.我从来没有在AMD GPU甚至CPU上遇到过问题.但是Nvidia对此造成了很多问题.始终似乎是对齐问题，从未到达正确的缓冲区等.

So I made a structure (workspace) that holds the pointer to the buffer in device memory, the struct act like an object with member variable you want to access through time and you want to stay alive for the whole execution. I never had a problem on AMD GPU or even on CPU. But Nvidia causing a lot of problems with this. It always seems to be an alignment problem, never reaching to right buffer, etc.

这里有一些代码可以帮助您，请参见以下问题:

Here some code to help, see question below:

在主机上定义的结构:

 #define SRC_IMG 0       // (float4 buffer) Source image
 #define LAB_IMG 1       // (float4 buffer) LAB image

 // NOTE: The size of this array should be as much as the last define + 1.
 #define __WRKSPC_SIZE__ 2 

 // Structure defined on host.
 struct Workspace
 {
      cl_ulong getPtr[__WRKSPC_SIZE__];
 };

 struct HostWorkspace
 {
      cl::Buffer srcImg;
      cl::Buffer labImg;
 };

设备上定义的结构:

typedef struct __attribute__(( packed )) gpuWorkspace
{
    ulong getPtr[__WRKSPC_SIZE__]; 
} gpuWorkspace_t;

请注意，在设备上，我使用ulong，在主机上，我使用cl_ulong，如下所示 OpenCL:使用struct作为内核参数.

Note that on device, I use ulong and on host I use cl_ulong as shown here OpenCL: using struct as kernel argument.

因此，一旦创建了用于源图像或LAB图像的cl :: Buffer，我将它们保存到HostWorkspace对象中，因此在释放该对象之前，将保留对cl :: Buffer的引用，因此整个项目中都存在缓冲区在主机上，在设备上事实上.

So once cl::Buffer for source image or LAB image are created, I save them into a HostWorkspace object, so until that object is released, the reference to cl::Buffer is kept, so buffer exists for the entire project on the host, and defacto on the device.

现在，我需要给那些设备喂食，所以我有一个简单的内核，它可以如下初始化我的设备工作区:

Now, I need to feed those the device, so I have a simple kernel which init my device workspace as follow:

__kernel void Workspace_Init(__global gpuWorkspace_t* wrkspc,
                             __global float4* src,
                             __global float4* LAB)
{
    // Get the ulong pointer on the first element of each buffer.
    wrkspc->getPtr[SRC_IMG] = &src[0];
    wrkspc->getPtr[LAB_IMG] = &LAB[0];
}

其中wrkspc是用struct Workspace分配的缓冲区，而src + LAB只是作为1D阵列映像分配的缓冲区.

where wrkspc is a buffer allocated with struct Workspace, and src + LAB are just buffer allocate as 1D array images.

然后，在我的任何内核中，如果要使用src或LAB，请按以下步骤操作:

And afterwards, in any of my kernel, if I want to use src or LAB, I do as follow:

__kernel void ComputeLABFromSrc(__global gpuWorkspace_t* wrkSpc)
{
    // =============================================================
    // Get pointer from work space.
    // =============================================================

    // Cast back the pointer of first element as a normal buffer you
    // want to use along the execution of the kernel.
    __global float4* srcData = ( __global float4* )( wrkSpc->getPtr[SRC_IMG] );
    __global float4* labData = ( __global float4* )( wrkSpc->getPtr[LAB_IMG] );

    // Code kernel as usual.
}

当我开始使用它时，我喜欢4-5张图像，它们运行得很好，但结构却不同:

When I started to use this, I had like 4-5 images which was going well, with a different structure like this:

struct Workspace
{
    cl_ulong imgPtr;
    cl_ulong labPtr;
};

每个图像都有自己的指针.

where each image had there own pointer.

在某个时候，我获得了更多的图像，并且遇到了一些问题.因此，我在线搜索，发现了一些建议，即设备/主机之间的结构的sizeof()可能不同，因此我将其同时更改为单个数组，并且在16个元素之前都可以正常工作.

At a certain point I reach more images, and I had some problem. So I search online, and I found some recommendation that the sizeof() the struct could be different in-between device/host, so I change it to a single array of the same time, and this works fine until 16 elements.

因此，我进行了更多搜索，找到了关于属性((包装))的建议，该建议已放入设备结构中(请参见上文).但是现在，我到达了26个元素，当我在设备或主机上检查结构的大小时，大小为208(元素* sizeof(cl_ulong)== 26 * 8).但是我仍然有一个与先前模型类似的问题，我的指针在先前图像的中间其他地方被读取了，等等.

So I search more, and I found a recommendation about the attribute((packed)), which I put on the device structure (see above). But now, I reach 26 elements, when I check the sizeof the struct either on device or on host, the size is 208 (elements * sizeof(cl_ulong) == 26 * 8). But I still have a similar issue to my previous model, my pointer goes read somewhere else in the middle of the previous image, etc.

所以我想知道，是否有人尝试过类似的模型(也许使用不同的方法)，或者有任何技巧来使用此模型建立可靠的"模型.

So I have wondering, if anyone ever try a similar model (maybe with a different approach) or have any tips to have a "solid" model with this.

请注意，所有内核均已正确编码，在AMD或CPU上使用相同代码执行时，我得到了很好的结果.唯一的问题是在英伟达上.

Note that all kernel are well coded, I have a good result when executing on AMD or on CPU with the same code. The only issue is on Nvidia.

使用结构作为缓冲支架 [英] Using structure as buffer holder

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

使用结构作为缓冲支架 [英] Using structure as buffer holder

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭