任务之间的OpenCL共享内存 [英] OpenCL Shared Memory Among Tasks

查看:587
本文介绍了任务之间的OpenCL共享内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力创建一个基于GPU的conway的游戏生活程序。如果您不熟悉,请参阅维基百科页面。我创建了一个版本,通过保持一个值数组,其中0表示死细胞,1是一个活的。内核然后简单地写入图像缓冲器数据阵列以基于单元数据绘制图像,然后检查每个单元的邻居以更新单元阵列以供下一次执行渲染。

I've been working to create a GPU based conway's game of life program. If you're not familiar with it, here is the Wikipedia Page. I created one version that works by keeping an array of values where 0 represents a dead cell, and 1 a live one. The kernel then simply writes to an image buffer data array to draw an image based on the cell data and then checks each cell's neighbors to update the cell array for the next execution to render.

然而,更快的方法代替单元格的值,如果死亡则为负数,如果活着则为正数。该单元的数量表示其具有的加1的值(使得0是不可能的值,因为我们不能将0与0区分开)。然而,这意味着当产生或杀死一个单元格时,我们必须相应地更新它的八个邻居的值。因此,与只需要从相邻存储器槽读取的工作过程不同,该过程必须写入这些槽。这样做不一致,并且输出的数组无效。例如,单元格包含指示13个邻居的数字,例如14,一个不可能的值。代码是正确的,因为我在cpu上写了相同的过程,它的工作原理。测试后,我相信当任务试图写入内存同时有一个延迟,导致一个写入错误。例如,可能在读取阵列数据和设置之间存在延迟,其中数据被改变,使得另一任务的过程不正确。我试过使用信号和障碍,但只是学到了OpenCL和并行处理,并没有完全把握他们完全。内核如下。

However, a faster method instead represents the value of a cell as a negative number if dead and a positive number if alive. The number of that cell represents the amount of neighbors it has plus one (making zero an impossible value since we cannot differentiate 0 from -0). However this means that when spawning or killing a cell we must update it's eight neighbor's values accordingly. Thus unlike the working procedure, which only has to read from the neighboring memory slots, this procedure must write to those slots. Doing so is inconsistent and the outputted array is not valid. For example cells contain numbers such as 14 which indicates 13 neighbors, an impossible value. The code is correct as I wrote the same procedure on the cpu and it works as expected. After testing, I believe that when tasks try to write to the memory at the same time there is a delay that leads to a writing error of some kind. For example, perhaps there is a delay between reading the array data and setting in which time the data is changed making another task's procedure incorrect. I've tried using semaphors and barriers, but have just learned OpenCL and parallel processing and don't quite grasp them completely yet. The kernel is as follows.

int wrap(int val, int limit){
    int response = val;
    if(response<0){response+=limit;}
    if(response>=limit){response-=limit;}
    return response;
}

__kernel void optimizedModel(
        __global uint *output,
        int sizeX, int sizeY,
        __global uint *colorMap,
        __global uint *newCellMap,
        __global uint *historyBuffer
)
{
    // the x and y coordinates that currently being computed
    unsigned int x = get_global_id(0);
    unsigned int y = get_global_id(1);

    int cellValue = historyBuffer[sizeX*y+x];
    int neighborCount = abs(cellValue)-1;
    output[y*sizeX+x] = colorMap[cellValue > 0 ? 1 : 0];

    if(cellValue > 0){// if alive
        if(neighborCount < 2 || neighborCount > 3){
            // kill

            for(int i=-1; i<2; i++){
                for(int j=-1; j<2; j++){
                    if(i!=0 || j!=0){
                        int wxc = wrap(x+i, sizeX);
                        int wyc = wrap(y+j, sizeY);
                        newCellMap[sizeX*wyc+wxc] -= newCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
                    }
                }
            }
            newCellMap[sizeX*y+x] *= -1;

            // end kill
        }
    }else{
        if(neighborCount==3){
            // spawn

            for(int i=-1; i<2; i++){
                for(int j=-1; j<2; j++){
                    if(i!=0 || j!=0){
                        int wxc = wrap(x+i, sizeX);
                        int wyc = wrap(y+j, sizeY);
                        newCellMap[sizeX*wyc+wxc] += newCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
                    }
                }
            }
            newCellMap[sizeX*y+x] *= -1;

            // end spawn
        }
    }
}


b $ b


  1. 数组输出是用于渲染
    内核计算的图像缓冲区数据。

  2. sizeX sizeY 常量分别是图像缓冲区的宽度和高度。

  3. 数组包含分别用于黑色和白色的rgb整数值,用于正确更改图像缓冲区的值以渲染颜色。

  4. newCellMap

  5. historyBuffer 是内核调用开始时单元格的旧状态。每次执行内核时,此数组都更新为newCellMap数组。

  1. The array output is the image buffer data used to render the kernel's computation.
  2. The sizeX and sizeY constants are the width and height of the image buffer respectively.
  3. The colorMap array contains the rgb integer values for black and white respectively which are used to change the image buffer's values properly to render colors.
  4. The newCellMap array is the updated cell map being calculated once rendering is determined.
  5. The historyBuffer is the old state of the cells at the beginning of the kernel call. Every time the kernel is executed, this array is updated to the newCellMap array.

使空间环形。我如何修复这个代码,使其工作原理。为什么全局内存不会随着任务的变化而更新?

Additionally the wrap function makes the space toroidal. How could I fix this code such that it works as expected. And why doesn't the global memory update with each change by a task? Isn't it supposed to be shared memory?

推荐答案

正如sharpneli在他的回答中所说,你正在读取和写入相同的内存区域从不同的线程,并给出一个未定义的行为。

As sharpneli said in his answer, you are reading and writing same memory zones from different threads and that gives an undefined behaviour.

解决方案:
您需要在2个阵列中分割 newCellMap ,一个用于上一次执行,另一个用于存储新值。然后,您需要在每次调用中从主机端更改内核参数,以便下次迭代的 oldvalues newvalues 。由于您如何构造算法,您还需要在运行之前执行 oldvalues newvalues 的copybuffer 。

Solution: You need to split your newCellMap in 2 arrays, one for the previous execution and one where the new value will be stored. Then, you need to change the kernel arguments from the host side in each call, so that the oldvalues of the next iteration are the newvalues of the previous iteration. Due to how you structurize your algorithm, you will also need to perform a copybuffer of oldvalues to newvalues before you run it.

__kernel void optimizedModel(
        __global uint *output,
        int sizeX, int sizeY,
        __global uint *colorMap,
        __global uint *oldCellMap,
        __global uint *newCellMap,
        __global uint *historyBuffer
)
{
    // the x and y coordinates that currently being computed
    unsigned int x = get_global_id(0);
    unsigned int y = get_global_id(1);

    int cellValue = historyBuffer[sizeX*y+x];
    int neighborCount = abs(cellValue)-1;
    output[y*sizeX+x] = colorMap[cellValue > 0 ? 1 : 0];

    if(cellValue > 0){// if alive
        if(neighborCount < 2 || neighborCount > 3){
            // kill

            for(int i=-1; i<2; i++){
                for(int j=-1; j<2; j++){
                    if(i!=0 || j!=0){
                        int wxc = wrap(x+i, sizeX);
                        int wyc = wrap(y+j, sizeY);
                        newCellMap[sizeX*wyc+wxc] -= oldCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
                    }
                }
            }
            newCellMap[sizeX*y+x] *= -1;

            // end kill
        }
    }else{
        if(neighborCount==3){
            // spawn

            for(int i=-1; i<2; i++){
                for(int j=-1; j<2; j++){
                    if(i!=0 || j!=0){
                        int wxc = wrap(x+i, sizeX);
                        int wyc = wrap(y+j, sizeY);
                        newCellMap[sizeX*wyc+wxc] += oldCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
                    }
                }
            }
            newCellMap[sizeX*y+x] *= -1;

            // end spawn
        }
    }
}


b $ b

关于你对共享内存的问题有一个简单的答案。 OpenCL没有通过HOST-DEVICE 的共享内存

当为设备创建内存缓冲区时,首先必须初始化内存区域与 clEnqueueWriteBuffer(),并使用 clEnqueueWriteBuffer()读取结果。即使你有一个指向内存区域的指针,你的指针也是指向该区域主机端副本的指针。这可能不是最后一个版本的设备计算的输出。

When you create a memory buffer for the device, you first have to init that memory zone with clEnqueueWriteBuffer() and read it with clEnqueueWriteBuffer() to get the results. Even if you do have a pointer to the memory zone, your pointer is a pointer to the host side copy of that zone. Which is likely not to have the last version of device computed output.

PD:我很久以前创建了一个Live简单而快速的方法就是创建一个大的二维数组(位寻址)。然后编写一段没有任何分支的代码,只是简单地分析neibours并获取该单元格的更新值。由于使用位寻址,因此每个线程读/写的存储器的量远低于寻址chars / ints / other。我在非常老的OpenCL硬件(nVIDIA 9100M G)中实现了33Mcell /秒。只是为了让您知道您的if / else方法可能不是最高效的方法。

这篇关于任务之间的OpenCL共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆