在cudaDeviceSynchronize上的非法内存访问 [英] Illegal Memory Access on cudaDeviceSynchronize

查看:251
本文介绍了在cudaDeviceSynchronize上的非法内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到一个非常奇怪的错误,因为在运行特定大小的Heat 2D模拟时出现非法内存访问"错误,但是如果我运行完全相同的模拟,并且元素较少,则模拟运行良好.

I am encountering a very strange bug in that I get an 'illegal memory access' error when running a Heat 2D simulation of a particular size, but the simulation runs well if I run the exact same simulation, just with fewer elements.

是否有理由增加数组大小会导致此异常?我使用的是Titan Black GPU(6 GB内存),但是我正在运行的仿真距离该大小还差得远.我计算可以进行4000 x 4000的模拟,但如果超过250 x 250,则会出现错误.

Is there a reason that increasing the size of an array would cause this exception? I am using a Titan Black GPU (6 GB of memory), but the simulation I am running is nowhere near that size. I calculated that I could run a 4000 x 4000 simulation, but I get errors if I exceed 250 x 250.

实例化设备上的模拟对象数组后,立即发生错误.实例化代码如下:

The error occurs immediately after I instantiate the array of simulation objects on the device. Instantiation code is as follows:

template<typename PlaceType, typename StateType>
__global__ void instantiatePlacesKernel(Place** places, StateType *state,
        void *arg, int *dims, int nDims, int qty) {
    unsigned idx = blockDim.x * blockIdx.x + threadIdx.x;

    if (idx < qty) {
        // set pointer to corresponding state object
        places[idx] = new PlaceType(&(state[idx]), arg);
        places[idx]->setIndex(idx);
        places[idx]->setSize(dims, nDims);
    }
}

template<typename PlaceType, typename StateType>
Place** DeviceConfig::instantiatePlaces(int handle, void *argument, int argSize,
        int dimensions, int size[], int qty) {

    // add global constants to the GPU
    memcpy(glob.globalDims,size, sizeof(int) * dimensions);
    updateConstants(glob);

    // create places tracking
    PlaceArray p; // a struct to track qty, 
    p.qty = qty;

    // create state array on device
    StateType* d_state = NULL;
    int Sbytes = sizeof(StateType);
    CATCH(cudaMalloc((void** ) &d_state, qty * Sbytes));
    p.devState = d_state; // save device pointer

    // allocate device pointers
    Place** tmpPlaces = NULL;
    int ptrbytes = sizeof(Place*);
    CATCH(cudaMalloc((void** ) &tmpPlaces, qty * ptrbytes));
    p.devPtr = tmpPlaces; // save device pointer

    // handle arg if necessary
    void *d_arg = NULL;
    if (NULL != argument) {
        CATCH(cudaMalloc((void** ) &d_arg, argSize));
        CATCH(cudaMemcpy(d_arg, argument, argSize, H2D));
    }

    // load places dimensions
    int *d_dims;
    int dimBytes = sizeof(int) * dimensions;
    CATCH(cudaMalloc((void** ) &d_dims, dimBytes));
    CATCH(cudaMemcpy(d_dims, size, dimBytes, H2D));

    // launch instantiation kernel
    int blockDim = (qty - 1) / BLOCK_SIZE + 1;
    int threadDim = (qty - 1) / blockDim + 1;
    Logger::debug("Launching instantiation kernel");
    instantiatePlacesKernel<PlaceType, StateType> <<<blockDim, threadDim>>>(tmpPlaces, d_state,
            d_arg, d_dims, dimensions, qty);
    CHECK();

    CATCH(cudaDeviceSynchronize()); // ERROR OCCURS HERE

    // clean up memory
    if (NULL != argument) {
        CATCH(cudaFree(d_arg));
    }
    CATCH(cudaFree(d_dims));
    CATCH(cudaMemGetInfo(&freeMem, &allMem));

    return p.devPtr;
}

请假定您看到的所有自定义类型都在工作,因为此代码在足够小的仿真下将正确执行.我感到沮丧的是,当大小超过250 x 250个元素时,内核函数的位置和状态数组中的元素数量似乎会导致错误.任何见解都会很棒.

Please assume any custom types you see are working, as this code executes without error on a sufficiently small simulation. I am frustrated that it appears that the number of elements in the kernel function's places and state arrays causes an error when the size exceeds 250 x 250 elements. Any insight would be awesome.

谢谢!

推荐答案

我认为内核new可能会失败,因为您分配的内存过多.

I think it's likely that in-kernel new is failing, because you are allocating too much memory.

内核new具有与内核malloc .这些分配仅限于设备堆,默认情况下,该堆大小为8MB.如果250x250数组的大小对应于该范围(8MB)中的某个大小,则大大超过该大小将导致某些新操作无声地"失败(即返回空指针).如果然后尝试使用这些空指针,则会获得非法的内存访问.

In-kernel new has similar behavior and limitations as in-kernel malloc. These allocations are limited to the device heap, which starts out by default at 8MB. If the 250x250 array size corresponds to something in that range (8MB), then going significantly above that would cause some of the new operations to "silently" fail (i.e. return null pointers). If you then try to use those null pointers, you'll get an illegal memory access.

一些建议:

  1. 弄清楚您需要多少空间,并使用cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
  2. 提前在设备堆上保留它们.
  3. 当您遇到使用newmalloc的内核的麻烦时,出于调试目的,也许可以使用调试宏来检查返回的指针是否为NULL.一般来说,这是一个好习惯.
  4. 您可以使用
  1. Figure out how much space you need, and pre-reserve it ahead of time on the device heap using cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
  2. When you're having trouble with kernels that use new or malloc, it may be useful for debug purposes to perhaps use a debug macro to check the returned pointers for NULL. This is a good practice in general.
  3. You can learn how to debug an illegal memory access with more clarity (localizing it to a specific line in a specific kernel) using the method described here.

这篇关于在cudaDeviceSynchronize上的非法内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆