CUDA 内核未在 CudaDeviceSynchronize 之前启动 [英] CUDA kernels not launching before CudaDeviceSynchronize

查看：28 发布时间：2022/1/10 15:56:33 cuda

本文介绍了CUDA 内核未在 CudaDeviceSynchronize 之前启动的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在并发 CUDA 方面遇到了一些问题.看一下附加的图像.内核在标记点启动，即 0.395 秒.然后是一些绿色的 CpuWork.最后，调用了 cudaDeviceSynchronize.在 CpuWork 之前启动的内核不会在同步调用之前启动.理想情况下，它应该与 CPU 工作并行运行.

I am having some trouble with concurrent CUDA. Take a look at the attached image. The kernel is launched at the marked point, at 0.395 seconds. Then there is some green CpuWork. Finally, there is a call to cudaDeviceSynchronize. The kernels that is launched before CpuWork doesnt start before the synchronize call. Ideally, it should run in parallel with the CPU work.

void KdTreeGpu::traceRaysOnGpuAsync(int firstRayIndex, int numRays, int rank, int buffer)
{
    int per_block = 128;
    int num_blocks = numRays/per_block + (numRays%per_block==0?0:1);

    Ray* rays = &this->deviceRayPtr[firstRayIndex];
    int* outputHitPanelIds = &this->deviceHitPanelIdPtr[firstRayIndex];

    kdTreeTraversal<<<num_blocks, per_block, 0>>>(sceneBoundingBox, rays, deviceNodesPtr, deviceTrianglesListPtr, 
                                                firstRayIndex, numRays, rank, rootNodeIndex, 
                                                deviceTHitPtr, outputHitPanelIds, deviceReflectionPtr);

    CUDA_VALIDATE(cudaMemcpyAsync(resultHitDistances[buffer], deviceTHitPtr, numRays*sizeof(double), cudaMemcpyDeviceToHost));
    CUDA_VALIDATE(cudaMemcpyAsync(resultHitPanelIds[buffer], outputHitPanelIds, numRays*sizeof(int), cudaMemcpyDeviceToHost));
    CUDA_VALIDATE(cudaMemcpyAsync(resultReflections[buffer], deviceReflectionPtr, numRays*sizeof(Vector3), cudaMemcpyDeviceToHost));
}

内存副本是异步的.结果缓冲区是这样分配的

The memcopies are async. The result buffers are allocated like this

unsigned int flag = cudaHostAllocPortable;

CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[0], MAX_RAYS_PER_ITERATION*sizeof(int), flag));
CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[1], MAX_RAYS_PER_ITERATION*sizeof(int), flag));

希望有一个解决方案.尝试了很多事情，包括不在默认流中运行.当我添加 cudaHostAlloc 时，我意识到异步方法返回到 CPU.但是当内核在稍后调用 deviceSynchronize 之前没有启动时，这无济于事.

Hoping for a solution for this. Have tried many things, including not running in the default stream. When i added cudaHostAlloc i recognized that the async method returned back to the CPU. But that doesnt help when the kernel does not launch before the deviceSynchronize call later.

resultHitDistances[2] 包含两个分配的内存区域，因此当 CPU 读取 0 时，GPU 应该将结果放入 1.

resultHitDistances[2] contains two allocated memory areas so that when 0 is read by the CPU, the GPU should put the result in 1.

谢谢！

这是调用 traceRaysAsync 的代码.

This is the code that calls traceRaysAsync.

int numIterations = ceil(float(this->numPrimaryRays) / MAX_RAYS_PER_ITERATION);
int numRaysPrevious = min(MAX_RAYS_PER_ITERATION, this->numPrimaryRays);
nvtxRangePushA("traceRaysOnGpuAsync First");
traceRaysOnGpuAsync(0, numRaysPrevious, rank, 0);
nvtxRangePop();

for(int iteration = 0; iteration < numIterations; iteration++)
{

    int rayFrom = (iteration+1)*MAX_RAYS_PER_ITERATION;
    int rayTo = min((iteration+2)*MAX_RAYS_PER_ITERATION, this->numPrimaryRays) - 1;
    int numRaysIteration = rayTo-rayFrom+1;

    // Wait for results to finish and get them

    waitForGpu();
    // Trace the next iteration asynchronously. This will have data prepared for next iteration

    if(numRaysIteration > 0)
    {
        int nextBuffer = (iteration+1) % 2;
        nvtxRangePushA("traceRaysOnGpuAsync Interior");
        traceRaysOnGpuAsync(rayFrom, numRaysIteration, rank, nextBuffer);
        nvtxRangePop();
    }
    nvtxRangePushA("CpuWork");

    // Store results for current iteration

    int rayOffset = iteration*MAX_RAYS_PER_ITERATION;
    int buffer = iteration % 2;

    for(int i = 0; i < numRaysPrevious; i++)
    {
        if(this->activeRays[rayOffset+i] && resultHitPanelIds[buffer][i] >= 0)
        {
            this->activeRays[rayOffset+i] = false;
            const TrianglePanelPair & t = this->getTriangle(resultHitPanelIds[buffer][i]);
            double hitT = resultHitDistances[buffer][i];

            Vector3 reflectedDirection = resultReflections[buffer][i];

            Result res = Result(rays[rayOffset+i], hitT, t.panel);
            results[rank].push_back(res);
            t.panel->incrementIntensity(1.0);

            if (t.panel->getParent().absorbtion < 1)
            {
                numberOfRaysGenerated++;

                Ray reflected (res.endPoint() + 0.00001*reflectedDirection, reflectedDirection);

                this->newRays[rayOffset+i] = reflected;
                this->activeRays[rayOffset+i] = true;
                numNewRays++;

            }
        }



    }

    numRaysPrevious = numRaysIteration;

    nvtxRangePop();

}

CUDA 内核未在 CudaDeviceSynchronize 之前启动 [英] CUDA kernels not launching before CudaDeviceSynchronize

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

CUDA 内核未在 CudaDeviceSynchronize 之前启动 [英] CUDA kernels not launching before CudaDeviceSynchronize

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭