将cudaMalloc同步主机和设备? [英] will cudaMalloc synchronize host and device?
本文介绍了将cudaMalloc同步主机和设备?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我知道cudaMemcpy会同步主机和设备,但是如何cudaMalloc或cudaFree?
I understand that cudaMemcpy will synchronize host and device, but how about cudaMalloc or cudaFree?
基本上我想异步同步内存分配/复制和内核执行多GPU设备,我的代码的简化版本是这样的:
Basically I want to asynchronize memory allocation/copy and kernel executions on multiple GPU devices, and a simplified version of my code is something like this:
void wrapper_kernel(const int &ngpu, const float * const &data)
{
cudaSetDevice(ngpu);
cudaMalloc(...);
cudaMemcpyAsync(...);
kernels<<<...>>>(...);
cudaMemcpyAsync(...);
some host codes;
}
int main()
{
const int NGPU=3;
static float *data[NGPU];
for (int i=0; i<NGPU; i++) wrapper_kernel(i,data[i]);
cudaDeviceSynchronize();
some host codes;
}
但是,GPU正在顺序运行,无法找到原因。 / p>
However, the GPUs are running sequentially, and can't find why.
推荐答案
尝试对每个GPU使用 cudaStream_t
下面是从CUDA示例中获取的simpleMultiGPU.cu。
Try using cudaStream_t
for each GPU. Below is simpleMultiGPU.cu taken from CUDA sample.
//Solver config
TGPUplan plan[MAX_GPU_COUNT];
//GPU reduction results
float h_SumGPU[MAX_GPU_COUNT];
....memory init....
//Create streams for issuing GPU command asynchronously and allocate memory (GPU and System page-locked) for (i = 0; i < GPU_N; i++)
{
checkCudaErrors(cudaSetDevice(i));
checkCudaErrors(cudaStreamCreate(&plan[i].stream));
//Allocate memory checkCudaErrors(cudaMalloc((void **)&plan[i].d_Data, plan[i].dataN * sizeof(float)));
checkCudaErrors(cudaMalloc((void **)&plan[i].d_Sum, ACCUM_N * sizeof(float)));
checkCudaErrors(cudaMallocHost((void **)&plan[i].h_Sum_from_device, ACCUM_N * sizeof(float)));
checkCudaErrors(cudaMallocHost((void **)&plan[i].h_Data, plan[i].dataN * sizeof(float)));
for (j = 0; j < plan[i].dataN; j++)
{
plan[i].h_Data[j] = (float)rand() / (float)RAND_MAX;
}
}
....kernel, memory copyback....
和这里一些使用指南多gpu。
and here's some guide of using multi gpu.
这篇关于将cudaMalloc同步主机和设备?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文