CUDA并发执行问题 [英] CUDA concurrent execution issue

查看:268
本文介绍了CUDA并发执行问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个基本的CUDA应用程序来演示为学生的内存传输/内核执行重叠。
但是使用nvvp,似乎没有并发执行。

I would like to create a basic CUDA application to demonstrate the memory transfer/kernel execution overlapping for students. But using the nvvp, it seems that there is no concurrent execution. Can you help me what is wrong?

完整来源(Visual Studio 2015,CUDA 8.0,sm3.5,arch3.5,Titan X card):

The full source (Visual Studio 2015, CUDA 8.0, sm3.5,arch3.5, Titan X card):

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <malloc.h>
#include <stdio.h>

#define MEMSIZE 8000000
#define STREAM_N 8

__global__ void TestKernel(char *img)
{
    int pos = blockIdx.x * blockDim.x + threadIdx.x;
    for (int k = 0; k < 100; k++)
        img[pos] = img[pos] / 2 + 128;
}

int main()
{
    // allocate memory and streams
    char *img[STREAM_N];
    char *d_img[STREAM_N];
    cudaStream_t streams[STREAM_N];

    for (int pi = 0; pi < STREAM_N; pi++)
    {
        cudaMalloc((void**)&d_img[pi], MEMSIZE / STREAM_N);
        cudaMallocHost((void**)&img[pi], MEMSIZE / STREAM_N);
        cudaStreamCreate(&streams[pi]);
    }

    // process packages one way
    cudaError_t stat;
    for (int pi = 0; pi < STREAM_N; pi++)
        cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
    for (int pi = 0; pi < STREAM_N; pi++)
        TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
    for (int pi = 0; pi < STREAM_N; pi++)
        cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);

    // process packages another way
    for (int pi = 0; pi < STREAM_N; pi++) 
    {
        cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
        TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
        cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);
    }
    cudaDeviceSynchronize();

    // destroy streams and free memory
    for (int pi = 0; pi < STREAM_N; pi++)
    {
        cudaStreamDestroy(streams[pi]);
        cudaFreeHost(img[pi]);
        cudaFree(d_img[pi]);
    }
}

可视化分析器输出:

推荐答案

WDDM命令批处理导致问题。
最佳解决方案是将卡的操作模式从WDDM切换到TCC。这可以通过nvidia-smi命令完成。

WDDM command batching caused the problem. The best solution is to switch the operating mode of the card from WDDM to TCC. This can be done via the nvidia-smi command.

nvidia-smi -i <gpu_id> -dm 1

这解决了我的问题。我想看到的模式:
时间表

This solved my problem. The pattern I would like to see: timeline

另一个解决方案是使用cudaStreamQuery手动刷新命令队列( source ),如:

An alternative solution is manually flushing the command queue using cudaStreamQuery (source), like:

for (int pi = 0; pi < STREAM_N; pi++) 
    {
        cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
        TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
        cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);
        cudaStreamQuery(streams[pi]); // FLUSH COMMAND QUEUE
    }

这篇关于CUDA并发执行问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆