CPU-GPU 并行编程 (Python) [英] CPU-GPU Parallel programming (Python)

查看:185
本文介绍了CPU-GPU 并行编程 (Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法让我们同时在 CPU 和 GPU 上运行函数(使用 Python)?我已经在使用 Numba 对 GPU 上的计算密集型函数进行线程级调度,但我现在还需要在 CPU-GPU 之间添加并行性.一旦我们确保 GPU 共享内存具有开始处理的所有数据,我需要触发 GPU 启动,然后使用 CPU 在主机上并行运行一些功能.

Is there a way we could concurrently run functions on CPU and GPU (using Python)? I'm already using Numba to do thread level scheduling for compute intensive functions on the GPU, but I now also need to add parallelism between CPU-GPU. Once we ensure that the GPU shared memory has all the data to start processing, I need to trigger the GPU start and then in parallel run some functions on the host using the CPU.

我确信 GPU 返回数据所花费的时间远远超过 CPU 完成一项任务所花费的时间.这样一旦 GPU 完成处理,CPU 就已经在等待将数据提取到主机.是否有标准库/方法来实现这一目标?感谢您在这方面的任何指示.

I'm sure that the time taken by GPU to return the data is much more than the CPU to finish a task. So that once the GPU has finished processing, CPU is already waiting to fetch the data to the host. Is there a standard library/way to achieve this? Appreciate any pointers in this regard.

推荐答案

感谢 Robert 和 Ander.我在考虑类似的路线,但不是很确定.我检查了这一点,直到我在内核之间为任务完成进行了一些同步(例如,在使用 CuPy 时使用 cp.cuda.Device().synchronize())我有效地并行运行 GPU-CPU.再次感谢.使用 Numba 使 gpu_functioncpu_function 并行运行的一般流程如下所示:

Thanks Robert and Ander. I was thinking on similar lines but wasn't very sure. I checked that until I put some synchronization for task completion between the cores, (for ex. cp.cuda.Device().synchronize() when using CuPy) I'm effectively running GPU-CPU in parallel. Thanks again. A general flow with Numba, to make gpu_function and cpu_function run in parallel will be something like the following:

    """ GPU has buffer full to start processing Frame N-1 """
    tmp_gpu = cp.asarray(tmp_cpu)
    gpu_function(tmp_gpu)
    """ CPU receives Frame N over TCP socket """
    tmp_cpu = cpu_function()
    """ For instance we know cpu_function takes [a little] longer than gpu_function """
     cp.cuda.Device().synchronize()

当然,我们甚至可以通过使用 PING-PONG 缓冲区和初始帧延迟来消除将 tmp_cpu 传输到 tmp_gpu 所花费的时间.

Of course, we could even do away with the time spent in transferring tmp_cpu to tmp_gpu by employing PING-PONG buffer and initial frame delay.

这篇关于CPU-GPU 并行编程 (Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆