使用多个GPU OpenCL [英] Using multiple GPUs OpenCL

查看:104
本文介绍了使用多个GPU OpenCL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个循环,可以在其中将多个内核启动到GPU上.以下是代码段:

I have a loop within which I am launching multiple kernels onto a GPU. Below is the snippet:

for (int idx = start; idx <= end ;idx ++) {

            ret = clEnqueueNDRangeKernel(command_queue, memset_kernel, 1, NULL,
                                            &global_item_size_memset, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching 1st memset_kernel !");


            ret = clEnqueueNDRangeKernel(command_queue, cholesky_kernel, 1, NULL,
                                                    &global_item_size_cholesky, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching 1st cholesky_kernel !");


            ret = clEnqueueNDRangeKernel(command_queue, ckf_kernel1, 1, NULL,
                                            &global_item_size_kernel1, &local_item_size, 0, NULL,  NULL);
            ASSERT_CL(ret, "Error after launching ckf_kernel1[i] !");



            clFinish(command_queue);
            ret = clEnqueueNDRangeKernel(command_queue, memset_kernel, 1, NULL,
                                            &global_item_size_memset, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching 2nd memset_kernel !");


            ret = clEnqueueNDRangeKernel(command_queue, cholesky_kernel, 1, NULL,
                                                    &global_item_size_cholesky, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching 2nd cholesky_kernel !");


            ret = clSetKernelArg(ckf_kernel2, 4, sizeof(idx), (void *)&idx);

            ret = clEnqueueNDRangeKernel(command_queue, ckf_kernel2, 1, NULL,
                                            &global_item_size_kernel2, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching ckf_kernel2 !");

现在,我想将此代码用于具有多个GPU的系统.因此,我已完成以下步骤:

Now, I am wanting to use this code for a system which has multiple GPUs. So I have completed the following steps:

  • 为所有GPU创建了单个上下文.
  • 每个设备创建一个命令队列.
  • 为每个设备创建了单独的内核(假设两个gpu,下面的代码段)
  • 为每个设备分配了单独的设备缓冲区

  • created a single context for all the GPUs.
  • created one command queue per device.
  • created separate kernels for each device (code snippet below assuming two gpus)
  • allocated separate device buffers for each device

cl_kernel ckf_kernel1[2];
cl_kernel ckf_kernel2[2];
cl_kernel cholesky_kernel[2];
cl_kernel memset_kernel[2];

// read get kernel.
ckf_kernel1[0] = clCreateKernel(program, "ckf_kernel1", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel1[i]!");
ckf_kernel2[0] = clCreateKernel(program, "ckf_kernel2", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel2!");
memset_kernel[0] = clCreateKernel(program, "memset_zero", &ret);
ASSERT_CL(ret, "Cannot load memset_kernel!");
cholesky_kernel[0] = clCreateKernel(program, "cholesky_kernel", &ret);
ASSERT_CL(ret, "Cannot load cholesky_kernel!");

ckf_kernel1[1] = clCreateKernel(program, "ckf_kernel1", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel1[i]!");
ckf_kernel2[1] = clCreateKernel(program, "ckf_kernel2", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel2!");
memset_kernel[1] = clCreateKernel(program, "memset_zero", &ret);
ASSERT_CL(ret, "Cannot load memset_kernel!");
cholesky_kernel[1] = clCreateKernel(program, "cholesky_kernel", &ret);
ASSERT_CL(ret, "Cannot load cholesky_kernel!");

现在,我不确定如何在循环中将内核启动到不同的设备上.如何使它们并行执行?请注意,在上面的循环中有一个clFinish命令.

Now, I am not sure how to launch the kernels onto the different devices within the loop. How to get them to execute in parallel? Please note that there is a clFinish command within the loop above.

另一个问题:在主机上使用多个线程/进程(每个线程/进程负责在单个GPU上启动内核)是否是标准做法?

Another question: is it standard practice to use multiple threads/processes on the host where each thread/process is responsible for launching kernels on a single GPU?

推荐答案

  1. 您无需为所有设备创建单独的上下文.如果它们来自不同的平台,则只需要这样做.
  2. 您也无需创建单独的内核.您可以同时为多个设备编译内核(clBuildProgram支持多设备编译),如果在设备上启动内核,则运行时将知道内核实体是否持有对给定设备有效的设备二进制文件.
  3. 最简单的方法是:创建一个上下文,获取所需的所有设备,然后将其放置在一个数组中,并使用该数组构建内核,并为其中的每个设备创建一个command_queue.
  4. clEnqueueNDRange内核是非阻塞的.您的for循环不会破折号的唯一原因是由于clFinish()方法,最有可能是因为您使用的是顺序队列,这意味着没有clFinish的情况下,单个设备的情况也可以正常工作.

在OpenCL中最佳使用多GPU的总体思路是按照我提到的方式创建上下文内核队列,并使队列混乱.这样,如果命令没有不满足的依赖关系,则允许它们并行执行,例如. command2的输入不是command1的输出,那么可以自由地开始与command1并行执行.但是,如果使用此方法,则必须使用clEnqueueNDRangeKernels的最后几个参数,因为必须使用cl_events构建此依赖关系链.每个clEnqueueWhatever都可以等待一系列事件,这些事件源自其他命令.队列中的命令只有在满足其所有依赖性后才开始执行.

The general idea for best usage of multi-GPU in OpenCL, is create context-kernels-queues the way I mentioned, and make the queues out-of-order. That way commands are allowed to execute in parallel, if they don't have unmet dependencies, for eg. the input of command2 is not the output of command1, then it is free to start executing in parallel to command1. If you are using this method however, you HAVE to use the final few parameters to clEnqueueNDRangeKernels, because you have to build this chain of dependencies using cl_events. Every clEnqueueWhatever can wait on an array of events, which originate from other commands. Execution of a command in the queue will only start once all it's dependencies are met.

有一个尚未解决的问题,那就是缓冲区的概念.如果要运行多GPU,则需要分别为设备显式创建缓冲区,并对数据进行分区.在两个设备上都尝试将其设置为与参数相同的缓冲区是无效的.充其量,运行时将序列化您的工作,而这两个设备将不能并行工作.这是因为缓冲区是内存的句柄,而运行时负责将缓冲区的内容移至需要缓冲区的设备. (这可以隐式发生(惰性内存移动),或者如果您调用clEnqueueMigrateBuffer则显式发生.)禁止运行时同时向2个设备提供具有CL_MEM_READ_WRITE或CL_MEM_WRITE_ONLY标志的相同缓冲区.即使您以程序员的身份知道,这两个设备可能未在写入缓冲区的同一部分,但运行时却并非如此.你必须告诉它.优雅的方法是创建2个子缓冲区,它们是较大/原始缓冲区的一部分;不太优雅的方法是简单地创建2个缓冲区.第一种方法更好,因为它更容易从多个设备收集数据回主机,因为您只需要获取大缓冲区,运行时将知道哪些子缓冲区已在哪些设备上被修改,并且它将花费负责收集数据.

There is one issue that you have not touched upon, and that is the idea of buffers. If you want to get multi-GPU running, you need to explicitly create buffers for your devices separately, and partition your data. It is not valid to have the same buffer set as argument on 2 devices, while both of them are trying to write it. At best, the runtime will serialize your work, and the 2 devices will not work in parallel. This is because buffers are handles to memory, and the runtime is responsible for moving the contents of the buffer to the devices that need it. (This can happen implicitly (lazy memory movement), or explicitly if you call clEnqueueMigrateBuffer.) The runtime is forbidden to give the same buffer with CL_MEM_READ_WRITE or CL_MEM_WRITE_ONLY flags to 2 devices simultaneously. Even though you know as the programmer, that the 2 devices might not be writing the same part of the buffer, the runtime does not. You have to tell it. Elegant way is to create 2 sub-buffers, that are part of the larger/original buffer; less elegant way is to simply create 2 buffers. The first approach is better, because it is easier to collect data from multiple devices back to host, because you need to fetch only the large buffer, and the runtime will know which sub-buffers have been modified on which devices, and it will take care of collecting the data.

如果我看到了您的clSetKernelArgument调用以及正在使用的缓冲区,那么我可以看到对内核的依赖关系,并写出您需要执行的操作,但是我认为这对于您获得多重性是一个不错的开始-设备正在运行.最终,一切都与数据有关. (并开始使用乱序队列,因为它有可能变得更快,并且它迫使您开始使用事件,这使您和任何阅读代码的人都可以清楚地看到事件,哪些内核可以并行运行.

If I saw your clSetKernelArgument calls, and the buffers you are using, I could see what the dependencies are to your kernels and write out what you need to do, but I think this is a fairly good start for you in getting multi-device running. Ultimately, it's all about the data. (And start using out-of-order queues, because it has the potential to be faster, and it forces you to start using events, which make it explicit to you and anyone reading the code, which kernels are allowed to run in parallel.

这篇关于使用多个GPU OpenCL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆