在设备和主机之间来回复制全局设备上指针地址 [英] Copying global on-device pointer address back and forth between device and host

查看:92
本文介绍了在设备和主机之间来回复制全局设备上指针地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在OpenCL设备(GPU)上创建了一个缓冲区,然后从主机上我需要知道全局设备上指针地址,以便可以将该设备上地址放到另一个缓冲区中,以便内核可以从包含第一个缓冲区地址的缓冲区读取,以便随后可以访问该缓冲区的内容.

I created a buffer on the OpenCL device (a GPU), and from the host I need to know the global on-device pointer address so that I can put that on-device address in another buffer so that the kernel can then read from that buffer that contains the address of the first buffer so that then it can access the contents of that buffer.

如果这令人困惑,这就是我要做的事情:我创建了一个包含浮点数的通用缓冲区,该缓冲区代表2D图像,然后从主机创建一个待办事项列表,其中列出了我的内核需要绘制的所有内容,哪些行,哪个圆圈,哪些图像...因此,内核必须从该列表中知道在哪里可以找到该图像,但是对该图像的引用不能作为内核参数传递,因为该内核可能不会绘制任何图像,也不会绘制一千个不同的图像,具体取决于列表的内容,因此必须在用作我的内核待办事项列表的缓冲区中对其进行引用.

If that's confusing here's what I'm trying to do: I create a generic floats-containing buffer representing a 2D image, then from the host I create a todo list of all the things my kernel needs to draw, which lines, which circles, which images... So from that list the kernel has to know where to find that image, but the reference to that image cannot be passed as a kernel argument, because that kernel might draw no image, or a thousand different images, all depending on what the list says, so it has to be referenced in that buffer that serves as a todo list for my kernel.

为此,我尝试制作一个函数,该函数在创建图像缓冲区后调用内核,该图像缓冲区获取该缓冲区并将全局设备内地址作为ulong返回到另一个缓冲区中,然后主机将该值存储在64位整数,如下所示:

To do so I tried making a function that calls a kernel after the creation of the image buffer that gets the buffer and returns the global on-device address as a ulong in another buffer, then the host stores that value in a 64-bit integer, like this:

uint64_t get_clmem_device_address(clctx_t *clctx, cl_mem buf)
{
    const char kernel_source[] =
"kernel void get_global_ptr_address(global void *ptr, global ulong *devaddr)        \n"
"{                                          \n"
"   *devaddr = (ulong) ptr;                             \n"
"}                                          \n";

    int32_t i;
    cl_int ret;
    static int init=1;
    static cl_program program;
    static cl_kernel kernel;
    size_t global_work_size[1];
    static cl_mem ret_buffer;
    uint64_t devaddr;

    if (init)
    {
        init=0;
        ret = build_cl_program(clctx, &program, kernel_source);
        ret = create_cl_kernel(clctx, program, &kernel, "get_global_ptr_address");
        ret_buffer = clCreateBuffer(clctx->context, CL_MEM_WRITE_ONLY, 1*sizeof(uint64_t), NULL, &ret);
    }
    if (kernel==NULL)
        return ;

    // Run the kernel
    ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf);
    ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), &ret_buffer);

    global_work_size[0] = 1;
    ret = clEnqueueNDRangeKernel(clctx->command_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);         // enqueue the kernel
    ret = clEnqueueReadBuffer(clctx->command_queue, ret_buffer, CL_FALSE, 0, 1*sizeof(uint64_t), &devaddr, 0, NULL, NULL);      // copy the value
    ret = clFlush(clctx->command_queue);
    clFinish(clctx->command_queue);

    return devaddr;
}

显然,这可行(它确实返回一个数字,尽管很难知道它是否正确),但是随后我将此devaddr(主机上的64位整数)放在内核使用的todo列表缓冲区中要知道要做什么,然后在必要时(根据列表)内核将调用以下函数,其中le这里是指向todo列表中相关条目的指针,而64位地址是第一个元素:

Apparently this works (it does return a number, although it's hard to know if it's correct), but then I put this devaddr (a 64-bit integer on the host) in the todo list buffer that the kernel uses to know what to do, and then if necessary (according to the list) the kernel calls the function below, le here being a pointer to the relevant entry in the todo list, and the 64-bit address being the first element:

float4 blit_sprite(global uint *le, float4 pv)
{
    const int2 p = (int2) (get_global_id(0), get_global_id(1));
    ulong devaddr;
    global float4 *im;
    int2 im_dim;

    devaddr = ((global ulong *) le)[0];     // global address for the start of the image as a ulong
    im_dim.x = le[2];
    im_dim.y = le[3];

    im = (global float4 *) devaddr;     // ulong is turned into a proper global pointer

    if (p.x < im_dim.x)
        if (p.y < im_dim.y)
            pv += im[p.y * im_dim.x + p.x];     // this gives me a CL_OUT_OF_RESOURCES error, even when changing it to im[0]

    return pv;
}

,但令我大吃惊的是,它不起作用,它给了我一个CL_OUT_OF_RESOURCES错误,我认为这意味着我的im指针无效.实际上,它起作用了,但是当它不起作用时我使用了两种不同的上下文.但这仍然很笨拙.

but big surprise this doesn't work, it gives me a CL_OUT_OF_RESOURCES error, which I assume means my im pointer isn't valid. Actually it works, it didn't work when I used two different contexts. But it's still pretty unwieldy.

做我想做的事有没有那么奇怪的方法?

推荐答案

OpenCL标准不保证不会在内核调用之间物理地重新分配内存对象.因此,原始设备端地址仅在单个内核NDRange中有效.这就是为什么在主机端将OpenCL内存对象表示为透明结构指针的原因之一.

OpenCL standard doesn't guarantee that memory objects will not be physically reallocated between kernel calls. So, original Device-side address is valid only within single kernel NDRange. That's one of the reasons why OpenCL memory objects are represented on Host side as transparent structure pointers.

但是,您可以将偏移量保存到第一个内核中的内存对象的第一个字节,并将其传递给第二个内核.每次启动内核时,您都将在内核中获得实际的设备端地址.通过保存的移位值将其递增.那完全是合法的".

Though, you can save offset to memory object's first byte in 1st kernel and pass it to 2nd kernel. Every time you launch your kernel, you will obtain actual Device-side address within your kernel & increment it by saved shift value. That would be perfectly "legal".

这篇关于在设备和主机之间来回复制全局设备上指针地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆