OpenCL:CPU上的GPU没有正确的结果:如何正确地管理内存? [英] OpenCL: Correct results on CPU not on GPU: how to manage memory correctly?

查看:287
本文介绍了OpenCL:CPU上的GPU没有正确的结果:如何正确地管理内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

__kernel void CKmix(__global short* MCL, __global short* MPCL,__global short *C,  int S,  int B)
{       
    unsigned int i=get_global_id(0);
    unsigned int ii=get_global_id(1);
    MCL[i]+=MPCL[B*ii+i+C[ii]+S];
}

内核接口确定,它编译成功, CPU作为一个设备,但这是当我有程序的发布,并重新创建我的内存对象每次内核调用,这对我的测试目的是约16000次。

Kernel seams ok, it compiles successfully, and I have obtained the correct results using the CPU as a device, but that was when I had the program release and and recreate my memory objects each time the kernel is called, which for my testing purpose is about 16000 times.

我发布的代码是我现在在尝试使用固定内存和映射的地方。

The code I am posting is where I am at now, trying to use pinned memory and mapping.

OpenCLProgram = clCreateProgramWithSource(hContext[Plat-1][Dev-1],11, OpenCLSource, NULL ,NULL);
clBuildProgram(OpenCLProgram, 0,NULL,NULL, NULL,NULL);
ocKernel = clCreateKernel(OpenCLProgram, "CKmix", NULL);

这也是成功的。我有一个2d的上下文数组的原因是,我遍历所有平台和设备,并允许用户选择要使用的平台和设备。

This is also successful. The reason I have a 2d array of contexts is that I iterate through all platforms and devices and allow the user to select the platform and device to use.

WorkSize[0]=SN;
WorkSize[1]=NF;  

PinnedCCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, sizeof(short) *NF, NULL, NULL);
PinnedMCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z*NF, NULL, NULL);
PinnedMO =  clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z,NULL, NULL);
PinnedMTEMP =  clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z,NULL, NULL);

DevComboCCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE, sizeof(short) *NF, NULL, NULL);    
DevMappedMCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE , sizeof(short) * Z*NF, NULL,NULL);
DevMO =  clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE , sizeof(short) * Z,NULL, NULL);

MO = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMO, CL_TRUE, CL_MAP_READ, 0, sizeof(short)*Z, 0, NULL, NULL, NULL);
CCL = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedCCL, CL_TRUE, CL_MAP_WRITE, 0, sizeof(short)*NF, 0, NULL, NULL,NULL);
MCL = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMCL, CL_TRUE, CL_MAP_WRITE, 0, sizeof(short)*Z*NF, 0, NULL, NULL, NULL);
MTEMP = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMTEMP, CL_TRUE, CL_MAP_READ, 0, sizeof(short)*Z, 0, NULL, NULL, NULL);

for (n=0; n < Z; ++n) {
    MTEMP[n]=0;
    }

clSetKernelArg(ocKernel, 0, sizeof(cl_mem), (void*) &DevMO);
clSetKernelArg(ocKernel, 1, sizeof(cl_mem), (void*) &DevMCL);    
clSetKernelArg(ocKernel, 2, sizeof(cl_mem), (void*) &DevCCL);
clSetKernelArg(ocKernel, 3, sizeof(int),    (void*) &SH);
clSetKernelArg(ocKernel, 4, sizeof(int),    (void*) &SN);

以上构成我的初始化,下面的其余部分重复发生。

The above constitutes my initialization, and the rest below, happens repeatedly.

clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevMCL, CL_TRUE, 0, Z*NF*sizeof(short), MCL, 0, NULL, NULL);
clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevCCL, CL_TRUE, 0, NF*sizeof(short), CCL, 0, NULL, NULL);
clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevMO, CL_TRUE, 0, Z*sizeof(short), MTEMP, 0, NULL, NULL);

clEnqueueNDRangeKernel(hCmdQueue[Plat-1][Dev-1], ocKernel, 2, NULL, WorkSize, NULL, 0, NULL, NULL);
clEnqueueReadBuffer(hCmdQueue[Plat-1][Dev-1],DevMO, CL_TRUE, 0, Z * sizeof(short),(void*) MO , 0, NULL, NULL);

我检查错误,我没有得到任何错误。内核使用新鲜数据重复启动多次。我不知道我在哪里做错了。

I have checked for errors, and I am not getting any errors. The kernel is launched many times repeatedly with fresh data. I am not sure where I am doing wrong.

NVIDIA 550 ti计算能力2.1,
最新Dev驱动程序,
Cuda SDK 4.0,

NVIDIA 550 ti compute capability 2.1, latest Dev Driver, Cuda SDK 4.0,

推荐答案

我不知道它是否唯一的问题与代码,但是:

I don't know if its the only problem with the code, but this:

unsigned int i=get_global_id(0);
unsigned int ii=get_global_id(1);
MCL[i]+=MPCL[B*ii+i+C[ii]+S];

绝对不是一个好主意。你通常会得到多个线程工作在同一个 global_id(0),所以几个线程可能会尝试更新 MCL [i] simultaneous(请注意 + = 不是原子的)。我假设对于CPU没有足够的线程来显示这样的行为在大多数情况下,而在gpu上有成千上万的线程几乎肯定会导致问题。

is definitely not a good idea. You will generally get multiple threads working on the same global_id(0), so several threads might try to update MCL[i] simultaneous (note that += is not atomic). I would assume that for the CPU there are not enough threads generated to show such a behaviour in most of the cases, while having thousands of threads on the gpu will almost surely lead to problems.

这样做的最合理的方法是只有一个1维的工作集,并且每个线程累积到一个位置的所有值:

The most reasonable way to do this is to have only a 1 dimensional workingset and for each thread accumulate all values which go to one position:

unsigned int i=get_global_id(0);
short accum = MCL[i]; //or 0, if thats the start
for(int ii = 0; ii < size; ++ii)
  accum += MPCL[B*ii+i+C[ii]+S];
MCL[i] = accum;

当然可能或不可行。如果不是这个修复可能不会那么简单。

Of course that might or might not be feasible. If it isn't the fix probably won't be quite that simple.

这篇关于OpenCL:CPU上的GPU没有正确的结果:如何正确地管理内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆