CUDA并发执行问题 [英] CUDA concurrent execution issue
问题描述
我想创建一个基本的CUDA应用程序来演示为学生的内存传输/内核执行重叠。
但是使用nvvp,似乎没有并发执行。
I would like to create a basic CUDA application to demonstrate the memory transfer/kernel execution overlapping for students. But using the nvvp, it seems that there is no concurrent execution. Can you help me what is wrong?
完整来源(Visual Studio 2015,CUDA 8.0,sm3.5,arch3.5,Titan X card):
The full source (Visual Studio 2015, CUDA 8.0, sm3.5,arch3.5, Titan X card):
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <malloc.h>
#include <stdio.h>
#define MEMSIZE 8000000
#define STREAM_N 8
__global__ void TestKernel(char *img)
{
int pos = blockIdx.x * blockDim.x + threadIdx.x;
for (int k = 0; k < 100; k++)
img[pos] = img[pos] / 2 + 128;
}
int main()
{
// allocate memory and streams
char *img[STREAM_N];
char *d_img[STREAM_N];
cudaStream_t streams[STREAM_N];
for (int pi = 0; pi < STREAM_N; pi++)
{
cudaMalloc((void**)&d_img[pi], MEMSIZE / STREAM_N);
cudaMallocHost((void**)&img[pi], MEMSIZE / STREAM_N);
cudaStreamCreate(&streams[pi]);
}
// process packages one way
cudaError_t stat;
for (int pi = 0; pi < STREAM_N; pi++)
cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
for (int pi = 0; pi < STREAM_N; pi++)
TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
for (int pi = 0; pi < STREAM_N; pi++)
cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);
// process packages another way
for (int pi = 0; pi < STREAM_N; pi++)
{
cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);
}
cudaDeviceSynchronize();
// destroy streams and free memory
for (int pi = 0; pi < STREAM_N; pi++)
{
cudaStreamDestroy(streams[pi]);
cudaFreeHost(img[pi]);
cudaFree(d_img[pi]);
}
}
可视化分析器输出:
推荐答案
WDDM命令批处理导致问题。
最佳解决方案是将卡的操作模式从WDDM切换到TCC。这可以通过nvidia-smi命令完成。
WDDM command batching caused the problem. The best solution is to switch the operating mode of the card from WDDM to TCC. This can be done via the nvidia-smi command.
nvidia-smi -i <gpu_id> -dm 1
这解决了我的问题。我想看到的模式:
时间表
This solved my problem. The pattern I would like to see: timeline
另一个解决方案是使用cudaStreamQuery手动刷新命令队列( source ),如:
An alternative solution is manually flushing the command queue using cudaStreamQuery (source), like:
for (int pi = 0; pi < STREAM_N; pi++)
{
cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);
cudaStreamQuery(streams[pi]); // FLUSH COMMAND QUEUE
}
这篇关于CUDA并发执行问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!