OpenCL：CPU上的GPU没有正确的结果：如何正确地管理内存？ [英] OpenCL: Correct results on CPU not on GPU: how to manage memory correctly?

查看：287 发布时间：2016/10/28 3:39:09 c++ opencl gpgpu nvidia

本文介绍了OpenCL：CPU上的GPU没有正确的结果：如何正确地管理内存？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

__kernel void CKmix(__global short* MCL, __global short* MPCL,__global short *C,  int S,  int B)
{       
    unsigned int i=get_global_id(0);
    unsigned int ii=get_global_id(1);
    MCL[i]+=MPCL[B*ii+i+C[ii]+S];
}

内核接口确定，它编译成功， CPU作为一个设备，但这是当我有程序的发布，并重新创建我的内存对象每次内核调用，这对我的测试目的是约16000次。

Kernel seams ok, it compiles successfully, and I have obtained the correct results using the CPU as a device, but that was when I had the program release and and recreate my memory objects each time the kernel is called, which for my testing purpose is about 16000 times.

我发布的代码是我现在在尝试使用固定内存和映射的地方。

The code I am posting is where I am at now, trying to use pinned memory and mapping.

OpenCLProgram = clCreateProgramWithSource(hContext[Plat-1][Dev-1],11, OpenCLSource, NULL ,NULL);
clBuildProgram(OpenCLProgram, 0,NULL,NULL, NULL,NULL);
ocKernel = clCreateKernel(OpenCLProgram, "CKmix", NULL);

这也是成功的。我有一个2d的上下文数组的原因是，我遍历所有平台和设备，并允许用户选择要使用的平台和设备。

This is also successful. The reason I have a 2d array of contexts is that I iterate through all platforms and devices and allow the user to select the platform and device to use.

WorkSize[0]=SN;
WorkSize[1]=NF;  

PinnedCCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR, sizeof(short) *NF, NULL, NULL);
PinnedMCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z*NF, NULL, NULL);
PinnedMO =  clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z,NULL, NULL);
PinnedMTEMP =  clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(short) * Z,NULL, NULL);

DevComboCCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE, sizeof(short) *NF, NULL, NULL);    
DevMappedMCL = clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE , sizeof(short) * Z*NF, NULL,NULL);
DevMO =  clCreateBuffer(hContext[Plat-1][Dev-1], CL_MEM_READ_WRITE , sizeof(short) * Z,NULL, NULL);

MO = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMO, CL_TRUE, CL_MAP_READ, 0, sizeof(short)*Z, 0, NULL, NULL, NULL);
CCL = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedCCL, CL_TRUE, CL_MAP_WRITE, 0, sizeof(short)*NF, 0, NULL, NULL,NULL);
MCL = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMCL, CL_TRUE, CL_MAP_WRITE, 0, sizeof(short)*Z*NF, 0, NULL, NULL, NULL);
MTEMP = (short*) clEnqueueMapBuffer(hCmdQueue[Plat-1][Dev-1], PinnedMTEMP, CL_TRUE, CL_MAP_READ, 0, sizeof(short)*Z, 0, NULL, NULL, NULL);

for (n=0; n < Z; ++n) {
    MTEMP[n]=0;
    }

clSetKernelArg(ocKernel, 0, sizeof(cl_mem), (void*) &DevMO);
clSetKernelArg(ocKernel, 1, sizeof(cl_mem), (void*) &DevMCL);    
clSetKernelArg(ocKernel, 2, sizeof(cl_mem), (void*) &DevCCL);
clSetKernelArg(ocKernel, 3, sizeof(int),    (void*) &SH);
clSetKernelArg(ocKernel, 4, sizeof(int),    (void*) &SN);

以上构成我的初始化，下面的其余部分重复发生。

The above constitutes my initialization, and the rest below, happens repeatedly.

clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevMCL, CL_TRUE, 0, Z*NF*sizeof(short), MCL, 0, NULL, NULL);
clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevCCL, CL_TRUE, 0, NF*sizeof(short), CCL, 0, NULL, NULL);
clEnqueueWriteBuffer(hCmdQueue[Plat-1][Dev-1], DevMO, CL_TRUE, 0, Z*sizeof(short), MTEMP, 0, NULL, NULL);

clEnqueueNDRangeKernel(hCmdQueue[Plat-1][Dev-1], ocKernel, 2, NULL, WorkSize, NULL, 0, NULL, NULL);
clEnqueueReadBuffer(hCmdQueue[Plat-1][Dev-1],DevMO, CL_TRUE, 0, Z * sizeof(short),(void*) MO , 0, NULL, NULL);

我检查错误，我没有得到任何错误。内核使用新鲜数据重复启动多次。我不知道我在哪里做错了。

I have checked for errors, and I am not getting any errors. The kernel is launched many times repeatedly with fresh data. I am not sure where I am doing wrong.

NVIDIA 550 ti计算能力2.1，
最新Dev驱动程序，
Cuda SDK 4.0，

NVIDIA 550 ti compute capability 2.1, latest Dev Driver, Cuda SDK 4.0,

推荐答案

我不知道它是否唯一的问题与代码，但是：

I don't know if its the only problem with the code, but this:

unsigned int i=get_global_id(0);
unsigned int ii=get_global_id(1);
MCL[i]+=MPCL[B*ii+i+C[ii]+S];

绝对不是一个好主意。你通常会得到多个线程工作在同一个 global_id（0），所以几个线程可能会尝试更新 MCL [i] simultaneous（请注意 + = 不是原子的）。我假设对于CPU没有足够的线程来显示这样的行为在大多数情况下，而在gpu上有成千上万的线程几乎肯定会导致问题。

is definitely not a good idea. You will generally get multiple threads working on the same global_id(0), so several threads might try to update MCL[i] simultaneous (note that += is not atomic). I would assume that for the CPU there are not enough threads generated to show such a behaviour in most of the cases, while having thousands of threads on the gpu will almost surely lead to problems.

这样做的最合理的方法是只有一个1维的工作集，并且每个线程累积到一个位置的所有值：

The most reasonable way to do this is to have only a 1 dimensional workingset and for each thread accumulate all values which go to one position:

unsigned int i=get_global_id(0);
short accum = MCL[i]; //or 0, if thats the start
for(int ii = 0; ii < size; ++ii)
  accum += MPCL[B*ii+i+C[ii]+S];
MCL[i] = accum;

当然可能或不可行。如果不是这个修复可能不会那么简单。

Of course that might or might not be feasible. If it isn't the fix probably won't be quite that simple.

这篇关于OpenCL：CPU上的GPU没有正确的结果：如何正确地管理内存？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

OpenCL：CPU上的GPU没有正确的结果：如何正确地管理内存？ [英] OpenCL: Correct results on CPU not on GPU: how to manage memory correctly?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

OpenCL：CPU上的GPU没有正确的结果：如何正确地管理内存？ [英] OpenCL: Correct results on CPU not on GPU: how to manage memory correctly?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭