Cuda在gpu和主机之间统一内存 [英] Cuda unified memory between gpu and host

查看:1053
本文介绍了Cuda在gpu和主机之间统一内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个基于cuda的程序,需要定期将一组项目从GPU传输到主机内存。为了保持进程异步,我希望使用cuda的UMA在主机内存中有一个内存缓冲区和标志(因此GPU和CPU都可以访问它)。 GPU将确保标志是清楚的,将其项目添加到缓冲器,并设置标志。 CPU等待标志被设置,将事件从缓冲区复制,并清除标志。就我所见,这不会产生任何竞争条件,因为它迫使GPU和CPU轮流,总是读取和写入彼此相反的标志。

I'm writing a cuda-based program that needs to periodically transfer a set of items from the GPU to the Host memory. In order to keep the process asynchronous, I was hoping to use cuda's UMA to have a memory buffer and flag in the host memory (so both the GPU and the CPU can access it). The GPU would make sure the flag is clear, add its items to the buffer, and set the flag. The CPU waits for the flag to be set, copies things out of the buffer, and clears the flag. As far as I can see, this doesn't produce any race condition because it forces the GPU and CPU to take turns, always reading and writing to the flag opposite each other.

到目前为止,我还没有能够得到这个工作,因为似乎有某种竞争条件。我想出了一个更简单的例子,有一个类似的问题:

So far I haven't been able to get this to work because there does seem to be some sort of race condition. I came up with a simpler example that has a similar issue:

#include <stdio.h>

__global__
void uva_counting_test(int n, int *h_i);

int main() {
    int *h_i;
    int n;

    cudaMallocHost(&h_i, sizeof(int));

    *h_i = 0;
    n = 2;

    uva_counting_test<<<1, 1>>>(n, h_i);

    //even numbers
    for(int i = 1; i <= n; ++i) {
        //wait for a change to odd from gpu
        while(*h_i == (2*(i - 1)));

        printf("host h_i: %d\n", *h_i);
        *h_i = 2*i;
    }

    return 0;
}

__global__
void uva_counting_test(int n, int *h_i) {
    //odd numbers
    for(int i = 0; i < n; ++i) {
        //wait for a change to even from host
        while(*h_i == (2*(i - 1) + 1));

        *h_i = 2*i + 1;
    }
}

对我来说,语句从CPU( host h_i:1 )。真正不寻常的事情(可能是一个线索)是,我可以让它工作在cuda-gdb。如果我在cuda-gdb中运行它,它会像以前一样挂起。如果我按ctrl + C,它会带我到内核中的while()循环行。从那里,令人惊讶的是,我可以告诉它继续,它会完成。对于n> 2,它会在每个内核后再次在内核中冻结while()循环,但是我可以使用ctrl + C继续推进它并继续。

For me, this case always hangs after the first print statement from the CPU (host h_i: 1). The really unusual thing (which may be a clue) is that I can get it to work in cuda-gdb. If I run it in cuda-gdb, it will hang as before. If I press ctrl+C, it will bring me to the while() loop line in the kernel. From there, surprisingly, I can tell it to continue and it will finish. For n > 2, it will freeze on the while() loop in the kernel again after each kernel, but I can keep pushing it forward with ctrl+C and continue.

推荐答案

如果有更好的方法来完成我想要做的事情,一个生产者 - 消费者模型,其中GPU正在产生一些数据,并且从时间到时间CPU将消耗该数据。

You are describing a producer-consumer model, where the GPU is producing some data and from time-to-time the CPU will consume that data.

实现这个的最简单的方法是有CPU作为主。 CPU在GPU上启动一个内核,当它准备好准备好消费数据时(即在你的示例中 while 循环),它与GPU同步,将数据复制从GPU,再次启动内核以生成更多的数据,并做与它复制的数据无关。这允许您在CPU处理上一批次时,让GPU填充固定大小的缓冲区(因为有两个副本,一个在GPU上,一个在CPU上)。

The simplest way to implement this is to have the CPU be the master. The CPU launches a kernel on the GPU, when it is ready to ready to consume data (i.e. the while loop in your example) it synchronises with the GPU, copies the data back from the GPU, launches the kernel again to generate more data, and does whatever it has to do with the data it copied. This allows you to have the GPU filling a fixed-size buffer while the CPU is processing the previous batch (since there are two copies, one on the GPU and one on the CPU).

这可以通过双缓冲数据来改进,这意味着您可以通过在缓冲区之间进行乒乓操作,使GPU忙于生成数据,从而将其他数据复制到CPU。这假设复印速度比生产速度快,但如果不是,那么复制带宽将会饱和,这也是好的。

That can be improved upon by double-buffering the data, meaning that you can keep the GPU busy producing data 100% of the time by ping-ponging between buffers as you copy the other to the CPU. That assumes the copy-back is faster than the production, but if not then you will saturate the copy bandwidth which is also good.

这些都不是你实际上描述的。你问的是让GPU掌握数据。我强烈要求谨慎,因为你需要仔细管理你的缓冲区大小,你需要仔细考虑时间和通信问题。这当然可以做这样的事情,但在你探索这个方向之前,你应该阅读关于内存条栅,原子操作和 volatile

Neither of those are what you actually described. What you asked for is to have the GPU master the data. I'd urge caution on that since you will need to manage your buffer size carefully and you will need to think carefully about the timings and communication issues. It's certainly possible to do something like that but before you explore that direction you should read up about memory fences, atomic operations, and volatile.

这篇关于Cuda在gpu和主机之间统一内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆