似乎达到了 CUDA 限制,但那是什么限制? [英] CUDA limit seems to be reached, but what limit is that?

查看:34
本文介绍了似乎达到了 CUDA 限制,但那是什么限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 CUDA 程序,它似乎达到了某种资源的某种限制,但我不知道该资源是什么.这是核函数:

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:

__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap, 
                        int segmentCount, int* output)
{
    int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
    int pointCount = segmentCount + 1;

    if(segmentIndex >= segmentCount)
        return;

    int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
    int result = 0;
    if(polylineIndex >= 0)
    {
        float2 p1 = points[segmentIndex];
        float2 p2 = points[segmentIndex+1];
        float2 A = p2;
        float2 a;
        a.x = p2.x - p1.x;
        a.y = p2.y - p1.y;

        for(int i = segmentIndex+2; i < segmentCount; i++)
        {
            int currentPolylineIndex = segmentToPolylineIndexMap[i];

            // if not a different segment within out polyline and
            // not a fake segment
            bool isLegit = (currentPolylineIndex != polylineIndex && 
                currentPolylineIndex >= 0);      

            float2 p3 = points[i];
            float2 p4 = points[i+1];
            float2 B = p4;
            float2 b;
            b.x = p4.x - p3.x;
            b.y = p4.y - p3.y;

            float2 c;
            c.x = B.x - A.x;
            c.y = B.y - A.y;

            float2 b_perp;
            b_perp.x = -b.y;
            b_perp.y = b.x;

            float numerator = dot(b_perp, c);
            float denominator = dot(b_perp, a);
            bool isParallel = (denominator == 0.0);

            float quotient = numerator / denominator;
            float2 intersectionPoint;
            intersectionPoint.x = quotient * a.x + A.x;
            intersectionPoint.y = quotient * a.y + A.y;

            result = result | (isLegit && !isParallel && 
                intersectionPoint.x > min(p1.x, p2.x) && 
                intersectionPoint.x > min(p3.x, p4.x) && 
                intersectionPoint.x < max(p1.x, p2.x) && 
                intersectionPoint.x < max(p3.x, p4.x) && 
                intersectionPoint.y > min(p1.y, p2.y) && 
                intersectionPoint.y > min(p3.y, p4.y) && 
                intersectionPoint.y < max(p1.y, p2.y) && 
                intersectionPoint.y < max(p3.y, p4.y));
        }
    }

    output[segmentIndex] = result;
}

这里是执行内核函数的调用:

Here is the call to execute the kernel function:

DoCheck<<<702, 32>>>(
    (float2*)devicePoints, 
    deviceSegmentsToPolylineIndexMap, 
    numSegments, 
    deviceOutput);

参数大小如下:

  • devicePoints = 22,464 float2s = 179,712 字节
  • deviceSegmentsToPolylineIndexMap = 22,463 个整数 = 89,852 个字节
  • numSegments = 1 个整数 = 4 个字节
  • deviceOutput = 22,463 个整数 = 89,852 个字节

当我执行这个内核时,它会导致显卡崩溃.看来我遇到了某种限制,因为如果我使用 DoCheck<<<300, 32>>>(...); 执行内核,它可以工作.明确一点,参数是一样的,只是块数不一样.

When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.

知道为什么一个会导致视频驱动程序崩溃,而另一个不会吗?失败的似乎仍在卡片的块数限制内.

Any idea why one crashes the video driver, and the other doesn't? The one that fail seems to be still within the card's limit on number of blocks.

更新有关我的系统配置的更多信息:

Update More information on my system configuration:

  • 显卡:nVidia 8800GT
  • CUDA 版本:1.1
  • 操作系统:Windows Server 2008 R2

我也在一台笔记本电脑上试了一下,配置如下,结果一样:

I also tried it on a laptop with the following configuration, but got the same results:

  • 显卡:nVidia Quadro FX 880M
  • CUDA 版本:1.2
  • 操作系统:Windows 7 64 位

推荐答案

被耗尽的资源是时间.在所有当前的 CUDA 平台上,显示驱动程序都包含一个看门狗计时器,它将杀死任何需要超过几秒钟才能执行的内核.在运行显示的卡上运行代码受此限制.

The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.

在您使用的 WDDM Windows 平台上,有三种可能的解决方案/解决方法:

On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:

  1. 获取 Telsa 卡,使用 TCC 驱动,彻底解决问题
  2. 尝试修改注册表设置以增加计时器限制(谷歌 TdrDelay 注册表项以获取更多信息,但我不是 Windows 用户,不能比这更具体)
  3. 将您的内核代码修改为可重入",并在多个内核启动而不是一次启动中处理数据并行工作负载.内核启动开销并没有那么大,而且处理多个内核运行的工作负载通常很容易实现,具体取决于您使用的算法.

这篇关于似乎达到了 CUDA 限制,但那是什么限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆