无法使用CUDA进入__global__函数 [英] can't enter into __global__ function using cuda

查看:634
本文介绍了无法使用CUDA进入__global__函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在Nsight上编写了一个代码,该代码可以编译并且可以执行,但是第一次启动无法完成。

I have written a code on Nsight that compiles and can be executed but the first launch can't be completed.

奇怪的是,当我运行它时在调试模式下,它可以完美运行,但速度太慢。

The strange thing is that when I run it in debug mode, it works perfectly but it is too slow.

这是进入访问GPU的函数之前代码的一部分(我认为这里有一个我找不到错误):

Here is the part of the code before entering the function that access the GPU (where i think there is an error I can't find) :

void parallelAction (int * dataReturned, char * data, unsigned char * descBase, int range, int cardBase, int streamIdx)
{
    size_t inputBytes = range*128*sizeof(unsigned char);
    size_t baseBytes = cardBase*128*sizeof(unsigned char);
    size_t outputBytes = range*sizeof(int);

    unsigned char * data_d;
    unsigned char * descBase_d;
    int * cardBase_d;
    int * dataReturned_d;

    cudaMalloc((void **) &data_d, inputBytes);  
    cudaMalloc((void **) &descBase_d, baseBytes);
    cudaMalloc((void **) &cardBase_d, sizeof(int));
    cudaMalloc((void **) &dataReturned_d, outputBytes);

    int blockSize = 196;
    int nBlocks = range/blockSize + (range%blockSize == 0?0:1);

    cudaMemcpy(data_d, data, inputBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(descBase_d, descBase, baseBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(cardBase_d, &cardBase, sizeof(int), cudaMemcpyHostToDevice);

    FindClosestDescriptor<<< nBlocks, blockSize >>>(dataReturned_d, data_d, descBase_d, cardBase_d);

    cudaMemcpy(dataReturned, dataReturned_d, outputBytes, cudaMemcpyDeviceToHost);

    cudaFree(data_d);
    cudaFree(descBase_d);
    cudaFree(cardBase_d);
    cudaFree(dataReturned_d);
}

函数进入GPU(我不认为错误在这里):

And the function entering the GPU (I don't think the error is here) :

__global__ void FindClosestDescriptor(int * dataReturned, unsigned char * data, unsigned char * base, int *cardBase)
{
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned char descriptor1[128], descriptor2[128];
    int part = 0;
    int result = 0;
    int winner = 0;
    int minDistance = 0;
    int itelimit = *cardBase;
    for (int k = 0; k < 128; k++)
    {
        descriptor1[k] = data[idx*128+k];

    }
    // initialize minDistance
    for (int k = 0; k < 128; k++)
    {
        descriptor2[k] = base[k];
    }

    for (int k = 0; k < 128; k++)
    {
        part = (descriptor1[k]-descriptor2[k]);
        part *= part;
        minDistance += part;
    }

    // test all descriptors in the base :
    for (int i = 1; i < itelimit; i++)
    {
        result = 0;
        for (int k = 0; k < 128; k++)
        {
            descriptor2[k] = base[i*128+k];
            // Calculate squared l2 distance :
            part = (descriptor1[k]-descriptor2[k]);
            part *= part;
            result += part;
        }

        // Compare to minDistance
        if (result < minDistance)
        {
            minDistance = result;
            winner = i;
        }
    }

    // Write the result in dataReturned
    dataReturned[idx] = winner;
}

请先谢谢您的帮助。

编辑:最后一个cudaMemcpy返回错误启动超时并终止。

EDIT : the last cudaMemcpy returns the error "the launch timed out and was terminated".

推荐答案

< Linux具有看门狗机制。如果您的内核运行了很长时间(您说它在调试模式下运行缓慢),则可以点击linux看门狗,并收到启动超时并被终止错误。

linux has a watchdog mechanism. If your kernel runs for a long time (you say it is slow in debug mode) you can hit the linux watchdog, and receive the "launch timed out and was terminated" error.

在这种情况下,您可以尝试几种方法。在此处中介绍了这些选项

In this case you have several things you might try. The options are covered here.

这篇关于无法使用CUDA进入__global__函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆