如何在Numba结束之前停止/取消Numba启动的cuda内核? [英] How to stop/cancel a cuda kernel launched by Numba before it ends?

查看:57
本文介绍了如何在Numba结束之前停止/取消Numba启动的cuda内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用Python/Numba编写的模拟程序,其中使用了多个cuda GPU.每一个都是使用单独的cuda上下文从不同的过程启动的.此模拟运行了一个很长的循环,最后将结果报告给父过程,该过程存储了到目前为止的最佳结果,并且过程一直在进行.

I have a simulation written with Python/Numba that uses several cuda GPUs. Each one is launched from a different process using a separate cuda context. This simulation runs a very long loop, and at the end reports the result to the parent process which stores the best result so far, and the process keeps going.

当一个GPU/进程完成其内核并报告新的最佳结果时,我想终止其他进程/GPU上的内核执行,以便他们可以选择这个新的最佳结果并对其进行迭代,而不是等待它们完成.每次执行可能需要30分钟,因此,如果我可以杀死刚刚开始的执行并再次使用更好的数据,那可以节省很多时间.

When a GPU / process finishes its kernel and reports a new best result, I like to kill the kernel executions on the other processes / GPUs so they can pick up this new best result and iterate over it, instead of waiting for them to finish. Each execution can take 30 mins, so if I can kill one that just started and go again with better data, that saves me a lot of time.

我似乎找不到停止已启动的cuda内核的方法.

I can't seem to find a way to stop a launched cuda kernel.

可以做到吗?

我正在使用Numba 0.51.

I'm using Numba 0.51.

推荐答案

如果没有以下条件,则无法在CUDA中停止正在运行的内核:

It's not possible to stop a running kernel in CUDA without:

    来自内核代码本身的
  1. 协助(或)
  2. 损坏CUDA上下文,使任何后续的CUDA操作失败

第2项是不令人满意的,因此异步地"使用第2项.停止正在运行的内核,将需要内核代码(所有线程)来轮询"执行.一个指示停止的位置.

Item 2 is not satisfactory, therefore to "asynchronously" stop a running kernel, will require the kernel code (all threads) to "poll" a location that gives an indication to stop.

具有存储位置的典型方法是在CUDA中使用固定/零复制技术.在numba中,使用映射的内存来分配这种类型的内存.可以同时从主机和设备访问此类内存.另一个麻烦是我们要求设备代码不要缓存用于通信的存储位置的任何副本.我在numba中找到的唯一实现此目的的方法是使用 atomics .

A typical way to have a memory location to do this would be to use pinned/zero-copy techniques in CUDA. In numba, this type of memory is allocated using mapped memory. Such memory is accessible from both host and device at the same time. An additional wrinkle is that we require the device code to not cache any copies of the memory locations used for communication. The only method I found in numba to accomplish this is to use atomics.

这是一个结合了这些想法的可行示例:

Here is a worked example combining these ideas:

$ cat t51.py
import numpy as np
import numba as nb

from numba import cuda

@cuda.jit
def test(arr):
    while nb.cuda.atomic.max(arr, 0, 0) < 1: #poll for signal to stop
        nb.cuda.atomic.add(arr, 1, 1)        #do "other work"
    arr[2] = 1                               #acknowledge stop signal

if __name__ == '__main__':

    arr = nb.cuda.mapped_array(3, dtype=np.int32)
    arr[0] = 0   # stop signal goes here
    arr[1] = 1   # monitoring "other work"
    arr[2] = 0   # acknowledgment of stop signal
    my_str = nb.cuda.stream()
    griddim = (1,1)
    blockdim = (1,1,1)
    test[griddim, blockdim, my_str](arr)   # launch work to be done
    for i in range(1000):  # for demo, give kernel time to start
        if arr[1] < 2:
            print(arr[1])
    print(arr[0])
    while arr[2] != 1:     # send stop signal, and wait for acknowledgment
        arr[0] = 1
    print(arr[0])          # for demo
    nb.cuda.synchronize()  # if stop is working correctly code will not hang here
    print(arr[0])          # for demo
    print(arr[1])
$ python t51.py
0
1
1
1600
$

这篇关于如何在Numba结束之前停止/取消Numba启动的cuda内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆