OpenCL:CPU上的GPU没有正确的结果:如何正确地管理内存? [英] OpenCL: Correct results on CPU not on GPU: how to manage memory correctly?
问题描述
__kernel void CKmix(__global short* MCL, __global short* MPCL,__global short *C, int S, int B)
{
unsigned int i=get_global_id(0);
unsigned int ii=get_global_id(1);
MCL[i]+=MPCL[B*ii+i+C[ii]+S];
}
内核接口确定,它编译成功, CPU作为一个设备,但这是当我有程序的发布,并重新创建我的内存对象每次内核调用,这对我的测试目的是约16000次。
Kernel seams ok, it compiles successfully, and I have obtained the correct results using the CPU as a device, but that was when I had the program release and and recreate my memory objects each time the kernel is called, which for my testing purpose is about 16000 times.
我发布的代码是我现在在尝试使用固定内存和映射的地方。
The code I am posting is where I am at now, trying to use pinned memory and mapping.
OpenCLProgram = clCreateProgramWithSource(hContext[Plat-1][Dev-1],11, OpenCLSource, NULL ,NULL);
clBuildProgram(OpenCLProgram, 0,NULL,NULL, NULL,NULL);
ocKernel = clCreateKernel(OpenCLProgram, "CKmix", NULL);
这也是成功的。我有一个2d的上下文数组的原因是,我遍历所有平台和设备,并允许用户选择要使用的平台和设备。
This is also successful. The reason I have a 2d array of contexts is that I iterate through all platforms and devices and allow the user to select the platform and device to use.
WorkSize[0]=SN;
WorkSize[1]=NF;
PinnedCCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, sizeof(short) *NF, NULL, NULL);
PinnedMCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z*NF, NULL, NULL);
PinnedMO = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z,NULL, NULL);
PinnedMTEMP = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z,NULL, NULL);
DevComboCCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE, sizeof(short) *NF, NULL, NULL);
DevMappedMCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE , sizeof(short) * Z*NF, NULL,NULL);
DevMO = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE , sizeof(short) * Z,NULL, NULL);
MO = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMO, CL_TRUE, CL_MAP_READ, 0, sizeof(short)*Z, 0, NULL, NULL, NULL);
CCL = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedCCL, CL_TRUE, CL_MAP_WRITE, 0, sizeof(short)*NF, 0, NULL, NULL,NULL);
MCL = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMCL, CL_TRUE, CL_MAP_WRITE, 0, sizeof(short)*Z*NF, 0, NULL, NULL, NULL);
MTEMP = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMTEMP, CL_TRUE, CL_MAP_READ, 0, sizeof(short)*Z, 0, NULL, NULL, NULL);
for (n=0; n < Z; ++n) {
MTEMP[n]=0;
}
clSetKernelArg(ocKernel, 0, sizeof(cl_mem), (void*) &DevMO);
clSetKernelArg(ocKernel, 1, sizeof(cl_mem), (void*) &DevMCL);
clSetKernelArg(ocKernel, 2, sizeof(cl_mem), (void*) &DevCCL);
clSetKernelArg(ocKernel, 3, sizeof(int), (void*) &SH);
clSetKernelArg(ocKernel, 4, sizeof(int), (void*) &SN);
以上构成我的初始化,下面的其余部分重复发生。
The above constitutes my initialization, and the rest below, happens repeatedly.
clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevMCL, CL_TRUE, 0, Z*NF*sizeof(short), MCL, 0, NULL, NULL);
clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevCCL, CL_TRUE, 0, NF*sizeof(short), CCL, 0, NULL, NULL);
clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevMO, CL_TRUE, 0, Z*sizeof(short), MTEMP, 0, NULL, NULL);
clEnqueueNDRangeKernel(hCmdQueue[Plat-1][Dev-1], ocKernel, 2, NULL, WorkSize, NULL, 0, NULL, NULL);
clEnqueueReadBuffer(hCmdQueue[Plat-1][Dev-1],DevMO, CL_TRUE, 0, Z * sizeof(short),(void*) MO , 0, NULL, NULL);
我检查错误,我没有得到任何错误。内核使用新鲜数据重复启动多次。我不知道我在哪里做错了。
I have checked for errors, and I am not getting any errors. The kernel is launched many times repeatedly with fresh data. I am not sure where I am doing wrong.
NVIDIA 550 ti计算能力2.1,
最新Dev驱动程序,
Cuda SDK 4.0,
NVIDIA 550 ti compute capability 2.1, latest Dev Driver, Cuda SDK 4.0,
推荐答案
我不知道它是否唯一的问题与代码,但是:
I don't know if its the only problem with the code, but this:
unsigned int i=get_global_id(0);
unsigned int ii=get_global_id(1);
MCL[i]+=MPCL[B*ii+i+C[ii]+S];
绝对不是一个好主意。你通常会得到多个线程工作在同一个 global_id(0)
,所以几个线程可能会尝试更新 MCL [i]
simultaneous(请注意 + =
不是原子的)。我假设对于CPU没有足够的线程来显示这样的行为在大多数情况下,而在gpu上有成千上万的线程几乎肯定会导致问题。
is definitely not a good idea. You will generally get multiple threads working on the same global_id(0)
, so several threads might try to update MCL[i]
simultaneous (note that +=
is not atomic). I would assume that for the CPU there are not enough threads generated to show such a behaviour in most of the cases, while having thousands of threads on the gpu will almost surely lead to problems.
这样做的最合理的方法是只有一个1维的工作集,并且每个线程累积到一个位置的所有值:
The most reasonable way to do this is to have only a 1 dimensional workingset and for each thread accumulate all values which go to one position:
unsigned int i=get_global_id(0);
short accum = MCL[i]; //or 0, if thats the start
for(int ii = 0; ii < size; ++ii)
accum += MPCL[B*ii+i+C[ii]+S];
MCL[i] = accum;
当然可能或不可行。如果不是这个修复可能不会那么简单。
Of course that might or might not be feasible. If it isn't the fix probably won't be quite that simple.
这篇关于OpenCL:CPU上的GPU没有正确的结果:如何正确地管理内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!