从其他CUDA流读取更新的内存 [英] Reading updated memory from other CUDA stream

查看:108
本文介绍了从其他CUDA流读取更新的内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一个内核函数中设置一个标志,并在另一个内核函数中读取它。基本上,我想尝试以下操作。

  #include< iostream> 
#include< cuda.h>
#include< cuda_runtime.h>

#define FLAGCLEAR 0
#define FLAGSET 1

使用命名空间std;

__global__ void set_flag(int * flag)
{
* flag = FLAGSET;

//等待标志复位。
while(* flag == FLAGSET);
}

__global__ void read_flag(int * flag)
{
//等待标志设置。
while(* flag!= FLAGSET);

//清除它下次。
* flag = FLAGCLEAR;
}

int main(void)
{
//设置标志内存
int * flag;
cudaMalloc(& flag,sizeof(int));

//设置流
cudaStream_t stream0,stream1;
cudaStreamCreate(& stream0);
cudaStreamCreate(& stream1);

//打印东西让我知道我们开始了。
cout<< 开始标记<< endl;

//做标志测试
set_flag<<<<< 1,1,0,stream0>>>
read_flag<<<< 1,1,0,stream1>>>(标志)

//等待流
cudaDeviceSynchronize();

//到这里是一个痛苦的过程!
cout<< 完成标记< endl;

//清理!
cudaStreamDestroy(stream0);
cudaStreamDestroy(stream1);
cudaFree(flag);

}

我最终得到第二个打印输出,冻结15秒,我得到两个打印输出同时。这些流应该并行运行,不会使系统崩溃。我做错了什么?如何解决此问题?



感谢。



似乎一个特殊的情况已经通过添加 volitile 解决了,但现在别的东西坏了。如果我在两个内核调用之间添加任何内容,系统将恢复到旧的行为,即立即冻结和打印所有内容。通过在 set_flag read_flag sleep(2); c $ c>。此外,当放在另一个程序中时,这会导致GPU锁定。我现在做错了什么?



再次感谢。

解决方案

编译器允许做相当积极的优化。此外,费米设备上的L1高速缓存不能保证是一致的。要解决这些问题,请尝试将 volatile 关键字添加到标志变量的函数使用中,如下所示:

  __ global__ void set_flag(volatile int * flag)

  __ global__ void read_flag(volatile int * flag)

一般来说,当在全局内存中的变量上使用时,这将导致编译器发出绕过L1缓存的加载,也通常将阻止将这些变量优化到寄存器中。



我认为你会有更好的结果。



你发布的代码有潜在的死锁由于这些问题。因此,您看到的观察结果可能实际上是操作系统(例如Windows TDR)中断您的程序。


I am trying to set a flag in one kernel function and read it in another. Basically, I'm trying to do the following.

#include <iostream>                                                              
#include <cuda.h>                                                                
#include <cuda_runtime.h>                                                        

#define FLAGCLEAR 0                                                              
#define FLAGSET   1                                                              

using namespace std;                                                             

__global__ void set_flag(int *flag)                                              
{                                                                                
    *flag = FLAGSET;                                                             

    // Wait for flag to reset.                                                   
    while (*flag == FLAGSET);                                                    
}                                                                                

__global__ void read_flag(int *flag)                                             
{                                                                                
    // wait for the flag to set.                                                 
    while (*flag != FLAGSET);                                                    

    // Clear it for next time.                                                   
    *flag = FLAGCLEAR;                                                           
}                                                                                

int main(void)                                                                   
{                                                                                
    // Setup memory for flag                                                     
    int *flag;                                                                   
    cudaMalloc(&flag, sizeof(int));                                              

    // Setup streams                                                             
    cudaStream_t stream0, stream1;                                               
    cudaStreamCreate(&stream0);                                                  
    cudaStreamCreate(&stream1);                                                  

    // Print something to let me know that we started.                           
    cout << "Starting the flagging" << endl;                                     

    // do the flag test                                                          
    set_flag  <<<1,1,0,stream0>>>(flag);                                         
    read_flag <<<1,1,0,stream1>>>(flag);                                         

    // Wait for the streams                                                      
    cudaDeviceSynchronize();                                                     

    // Getting here is a painful process!
    cout << "Finished the flagging" << endl;                                     

    // Clean UP!                                                                 
    cudaStreamDestroy(stream0);                                                  
    cudaStreamDestroy(stream1);                                                  
    cudaFree(flag);                                                              

}

I eventually get the second printout, but only after the computer freezes for 15 seconds, and I get both printouts at the same time. These streams are supposed to run in parallel, and not bog the system down. What am I doing wrong? How can I fix this?

Thanks.

EDIT

It seems as though a special case has been solved by adding volitile but now something else has broken. If I add anything between the two kernel calls, the system reverts back to the old behavior, namely freezing and printing everything at once. This behavior is shown by adding sleep(2); between set_flag and read_flag. Also, when put in another program, this causes the GPU to lock up. What am I doing wrong now?

Thanks again.

解决方案

The compiler is allowed to do fairly aggressive optimization. Furthermore, the L1 caches on Fermi devices are not guaranteed to be coherent. To work around these issues, try adding the volatile keyword to your functions usage of the flag variable like so:

__global__ void set_flag(volatile int *flag)       

and

__global__ void read_flag(volatile int *flag)     

Generally speaking, when used on a variable resident in global memory, this will cause the compiler to issue loads that bypass the L1 cache and will also generally speaking prevent optimizations of these variables into registers, for example.

I think you'll have better results.

The code you've posted has the potential to deadlock due to these issues. Therefore, the observation you're seeing may actually be the OS (e.g. windows TDR) interrupting your program.

这篇关于从其他CUDA流读取更新的内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆