CUDA限制似乎达到了,但是有什么限制呢? [英] CUDA limit seems to be reached, but what limit is that?

查看:147
本文介绍了CUDA限制似乎达到了,但是有什么限制呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CUDA程序,似乎在某种程度上限制了一些资源,但我不知道该资源是什么。这里是内核函数:

  __ global__ void DoCheck(float2 * points,int * segmentToPolylineIndexMap,
int segmentCount,int * output)
{
int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
int pointCount = segmentCount + 1;

if(segmentIndex> = segmentCount)
return;

int polylineIndex = segmentToPolylineIndexMap [segmentIndex];
int result = 0;
if(polylineIndex> = 0)
{
float2 p1 = points [segmentIndex];
float2 p2 = points [segmentIndex + 1];
float2 A = p2;
float2 a;
a.x = p2.x - p1.x;
a.y = p2.y - p1.y;

for(int i = segmentIndex + 2; i {
int currentPolylineIndex = segmentToPolylineIndexMap [i];

//如果不是out多行内不同的段和
//不是假段
bool isLegit =(currentPolylineIndex!= polylineIndex&
currentPolylineIndex > = 0);

float2 p3 = points [i];
float2 p4 = points [i + 1];
float2 B = p4;
float2 b;
b.x = p4.x - p3.x;
b.y = p4.y - p3.y;

float2 c;
c.x = B.x - A.x;
c.y = B.y - A.y;

float2 b_perp;
b_perp.x = -b.y;
b_perp.y = b.x;

float numerator = dot(b_perp,c);
float denominator = dot(b_perp,a);
bool isParallel =(denominator == 0.0);

float quotient = numerator / denominator;
float2 intersectionPoint;
intersectionPoint.x = quotient * a.x + A.x;
intersectionPoint.y = quotient * a.y + A.y;

result = result | (isLegit&&!isParallel&&
intersectionPoint.x> min(p1.x,p2.x)&&
intersectionPoint.x& p4.x)&&
intersectionPoint.x intersectionPoint.x< max(p3.x,p4.x) &&&&
intersectionPoint.y> min(p1.y,p2.y)&&
intersectionPoint.y> min(p3.y,p4.y)&&
intersectionPoint.y< max(p1.y,p2.y)&&
intersectionPoint.y< max(p3.y,p4.y));
}
}

输出[segmentIndex] = result;
}

这是调用执行内核函数:

  DoCheck<<< 702,32>>(
(float2 *)devicePoints,
deviceSegmentsToPolylineIndexMap,
numSegments,
deviceOutput);

参数大小如下:




  • devicePoints = 22,464 float2s = 179,712字节

  • deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852字节

  • numSegments = 1 int = 4字节

  • deviceOutput = 22,463 ints = 89,852字节



它崩溃的视频卡。看起来我碰到某种限制,因为如果我使用 DoCheck<<<<< 300,32>>(...);执行内核; ,它工作。只是为了清楚,参数是相同的,只是块的数量是不同的。



任何想法为什么一个崩溃的视频驱动程序,而另一个不?

更新





b有关我的系统配置的详细信息:




  • 视频卡:nVidia 8800GT

  • CUDA版本:1.1

  • 操作系统:Windows Server 2008 R2



但结果相同:




  • 视频卡:nVidia Quadro FX 880M

  • CUDA版本:1.2

  • 操作系统:Windows 7 64位


解决方案

正在耗尽的资源是时间。在所有当前的CUDA平台上,显示驱动程序包括一个看门狗定时器,它将杀死任何需要超过几秒钟执行的内核。



在您正在使用的WDDM Windows平台上,有三种可能的解决方案/解决方法:


  1. 取得Telsa卡并使用TCC驱动程式,完全消除问题

  2. 注册表设置以增加计时器限制(google的TdrDelay注册表项的更多信息,但我不是Windows用户,不能更具体的)

  3. 修改内核代码成为re-entrant,并在几个内核启动中处理数据并行工作负载,而不是一个。内核启动开销不是很大,并且根据您使用的算法,处理多个内核运行的工作负载通常很容易实现。


I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:

__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap, 
                        int segmentCount, int* output)
{
    int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
    int pointCount = segmentCount + 1;

    if(segmentIndex >= segmentCount)
        return;

    int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
    int result = 0;
    if(polylineIndex >= 0)
    {
        float2 p1 = points[segmentIndex];
        float2 p2 = points[segmentIndex+1];
        float2 A = p2;
        float2 a;
        a.x = p2.x - p1.x;
        a.y = p2.y - p1.y;

        for(int i = segmentIndex+2; i < segmentCount; i++)
        {
            int currentPolylineIndex = segmentToPolylineIndexMap[i];

            // if not a different segment within out polyline and
            // not a fake segment
            bool isLegit = (currentPolylineIndex != polylineIndex && 
                currentPolylineIndex >= 0);      

            float2 p3 = points[i];
            float2 p4 = points[i+1];
            float2 B = p4;
            float2 b;
            b.x = p4.x - p3.x;
            b.y = p4.y - p3.y;

            float2 c;
            c.x = B.x - A.x;
            c.y = B.y - A.y;

            float2 b_perp;
            b_perp.x = -b.y;
            b_perp.y = b.x;

            float numerator = dot(b_perp, c);
            float denominator = dot(b_perp, a);
            bool isParallel = (denominator == 0.0);

            float quotient = numerator / denominator;
            float2 intersectionPoint;
            intersectionPoint.x = quotient * a.x + A.x;
            intersectionPoint.y = quotient * a.y + A.y;

            result = result | (isLegit && !isParallel && 
                intersectionPoint.x > min(p1.x, p2.x) && 
                intersectionPoint.x > min(p3.x, p4.x) && 
                intersectionPoint.x < max(p1.x, p2.x) && 
                intersectionPoint.x < max(p3.x, p4.x) && 
                intersectionPoint.y > min(p1.y, p2.y) && 
                intersectionPoint.y > min(p3.y, p4.y) && 
                intersectionPoint.y < max(p1.y, p2.y) && 
                intersectionPoint.y < max(p3.y, p4.y));
        }
    }

    output[segmentIndex] = result;
}

Here is the call to execute the kernel function:

DoCheck<<<702, 32>>>(
    (float2*)devicePoints, 
    deviceSegmentsToPolylineIndexMap, 
    numSegments, 
    deviceOutput);

The sizes of the parameters are as follows:

  • devicePoints = 22,464 float2s = 179,712 bytes
  • deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 bytes
  • numSegments = 1 int = 4 bytes
  • deviceOutput = 22,463 ints = 89,852 bytes

When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.

Any idea why one crashes the video driver, and the other doesn't? The one that fail seems to be still within the card's limit on number of blocks.

Update More information on my system configuration:

  • Video Card: nVidia 8800GT
  • CUDA Version: 1.1
  • OS: Windows Server 2008 R2

I also tried it on a laptop with the following configuration, but got the same results:

  • Video Card: nVidia Quadro FX 880M
  • CUDA Version: 1.2
  • OS: Windows 7 64-bit

解决方案

The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.

On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:

  1. Get a Telsa card and use the TCC driver, which eliminates the problem completely
  2. Try modifying registry settings to increase the timer limit (google for the TdrDelay registry key for more information, but I am not a Windows user and can't be more specific than that)
  3. Modify your kernel code to be "re-entrant" and process the data parallel work load in several kernel launches rather than one. Kernel launch overhead isn't all that large and processing the workload over several kernel runs is often pretty easy to achieve, depending on the algorithm you are using.

这篇关于CUDA限制似乎达到了,但是有什么限制呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆