CUDA 内核未在 CudaDeviceSynchronize 之前启动 [英] CUDA kernels not launching before CudaDeviceSynchronize

查看:28
本文介绍了CUDA 内核未在 CudaDeviceSynchronize 之前启动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在并发 CUDA 方面遇到了一些问题.看一下附加的图像.内核在标记点启动,即 0.395 秒.然后是一些绿色的 CpuWork.最后,调用了 cudaDeviceSynchronize.在 CpuWork 之前启动的内核不会在同步调用之前启动.理想情况下,它应该与 CPU 工作并行运行.

I am having some trouble with concurrent CUDA. Take a look at the attached image. The kernel is launched at the marked point, at 0.395 seconds. Then there is some green CpuWork. Finally, there is a call to cudaDeviceSynchronize. The kernels that is launched before CpuWork doesnt start before the synchronize call. Ideally, it should run in parallel with the CPU work.

void KdTreeGpu::traceRaysOnGpuAsync(int firstRayIndex, int numRays, int rank, int buffer)
{
    int per_block = 128;
    int num_blocks = numRays/per_block + (numRays%per_block==0?0:1);

    Ray* rays = &this->deviceRayPtr[firstRayIndex];
    int* outputHitPanelIds = &this->deviceHitPanelIdPtr[firstRayIndex];

    kdTreeTraversal<<<num_blocks, per_block, 0>>>(sceneBoundingBox, rays, deviceNodesPtr, deviceTrianglesListPtr, 
                                                firstRayIndex, numRays, rank, rootNodeIndex, 
                                                deviceTHitPtr, outputHitPanelIds, deviceReflectionPtr);

    CUDA_VALIDATE(cudaMemcpyAsync(resultHitDistances[buffer], deviceTHitPtr, numRays*sizeof(double), cudaMemcpyDeviceToHost));
    CUDA_VALIDATE(cudaMemcpyAsync(resultHitPanelIds[buffer], outputHitPanelIds, numRays*sizeof(int), cudaMemcpyDeviceToHost));
    CUDA_VALIDATE(cudaMemcpyAsync(resultReflections[buffer], deviceReflectionPtr, numRays*sizeof(Vector3), cudaMemcpyDeviceToHost));
}

内存副本是异步的.结果缓冲区是这样分配的

The memcopies are async. The result buffers are allocated like this

unsigned int flag = cudaHostAllocPortable;

CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[0], MAX_RAYS_PER_ITERATION*sizeof(int), flag));
CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[1], MAX_RAYS_PER_ITERATION*sizeof(int), flag));

希望有一个解决方案.尝试了很多事情,包括不在默认流中运行.当我添加 cudaHostAlloc 时,我意识到异步方法返回到 CPU.但是当内核在稍后调用 deviceSynchronize 之前没有启动时,这无济于事.

Hoping for a solution for this. Have tried many things, including not running in the default stream. When i added cudaHostAlloc i recognized that the async method returned back to the CPU. But that doesnt help when the kernel does not launch before the deviceSynchronize call later.

resultHitDistances[2] 包含两个分配的内存区域,因此当 CPU 读取 0 时,GPU 应该将结果放入 1.

resultHitDistances[2] contains two allocated memory areas so that when 0 is read by the CPU, the GPU should put the result in 1.

谢谢!

这是调用 traceRaysAsync 的代码.

This is the code that calls traceRaysAsync.

int numIterations = ceil(float(this->numPrimaryRays) / MAX_RAYS_PER_ITERATION);
int numRaysPrevious = min(MAX_RAYS_PER_ITERATION, this->numPrimaryRays);
nvtxRangePushA("traceRaysOnGpuAsync First");
traceRaysOnGpuAsync(0, numRaysPrevious, rank, 0);
nvtxRangePop();

for(int iteration = 0; iteration < numIterations; iteration++)
{

    int rayFrom = (iteration+1)*MAX_RAYS_PER_ITERATION;
    int rayTo = min((iteration+2)*MAX_RAYS_PER_ITERATION, this->numPrimaryRays) - 1;
    int numRaysIteration = rayTo-rayFrom+1;

    // Wait for results to finish and get them

    waitForGpu();
    // Trace the next iteration asynchronously. This will have data prepared for next iteration

    if(numRaysIteration > 0)
    {
        int nextBuffer = (iteration+1) % 2;
        nvtxRangePushA("traceRaysOnGpuAsync Interior");
        traceRaysOnGpuAsync(rayFrom, numRaysIteration, rank, nextBuffer);
        nvtxRangePop();
    }
    nvtxRangePushA("CpuWork");

    // Store results for current iteration

    int rayOffset = iteration*MAX_RAYS_PER_ITERATION;
    int buffer = iteration % 2;

    for(int i = 0; i < numRaysPrevious; i++)
    {
        if(this->activeRays[rayOffset+i] && resultHitPanelIds[buffer][i] >= 0)
        {
            this->activeRays[rayOffset+i] = false;
            const TrianglePanelPair & t = this->getTriangle(resultHitPanelIds[buffer][i]);
            double hitT = resultHitDistances[buffer][i];

            Vector3 reflectedDirection = resultReflections[buffer][i];

            Result res = Result(rays[rayOffset+i], hitT, t.panel);
            results[rank].push_back(res);
            t.panel->incrementIntensity(1.0);

            if (t.panel->getParent().absorbtion < 1)
            {
                numberOfRaysGenerated++;

                Ray reflected (res.endPoint() + 0.00001*reflectedDirection, reflectedDirection);

                this->newRays[rayOffset+i] = reflected;
                this->activeRays[rayOffset+i] = true;
                numNewRays++;

            }
        }



    }

    numRaysPrevious = numRaysIteration;

    nvtxRangePop();

}

推荐答案

这是使用 WDDM 驱动程序模型的 Windows 上的预期行为,其中驱动程序尝试通过尝试批量内核启动来减轻内核启动开销.尝试在内核调用之后直接插入 cudaStreamQuery(0) 以在批处理满之前触发内核的提前启动.

This is the expected behavior on Windows with the WDDM driver model, where the driver tries to mitigate the kernel launch overhead by trying to batch kernel launches. Try inserting cudaStreamQuery(0) straight after the kernel invocation to trigger early launching of the kernel before the batch is full.

这篇关于CUDA 内核未在 CudaDeviceSynchronize 之前启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆