Opencl障碍无效 [英] Opencl barriers not working

查看:131
本文介绍了Opencl障碍无效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试了发布的代码。

我的想法是获取数组rms的输入数据的部分和,然后使障碍(GLOBAL和LOCAL)等到所有rms [k]被填充,然后将它们相加以获得媒体价值。

我放置了一些printf来建议微积分是否有错误。

我在printf warning-2上获得了错误但没有添加所有数据时应该发出警告-1和3,因为有些内核仍未完成计算部分总和。



我没有使用本地内存,只要最大本地大小为256,远小于高度= 10000



如何让GPU等到我计算所有部分总和?



我尝试了什么:



我有以下代码:

I tried the posted code.
My idea was to obtain partial sums of input data on array rms, then make barriers (GLOBAL and LOCAL) to wait until all rms[k] are filled, then sum all them to obtain the media value.
I placed some printf to advises if there are errors in the calculus.
I obtained errors at printf warning-2 but not warning-1 and 3 due when adding all data, bacause some of the cores still not finished to calculate the partial sums.

I did not used local memory as long as maximum local size is 256 that is lot smaller than height=10000

How I make GPU to wait until I calculated all partial sums?

What I have tried:

I have the following code:

__kernel void hallaRMS2(
	__global float*  data,  //size=WIDTH*HEIGHT
	int WIDTH,
	int HEIGHT,
	__global double *rms //size=HEIGHT
)
{
        int k = get_global_id(0); //0..HEIGHT
		__global float *x=data+k*WIDTH;


		double sum=0.0;

		for (int j=0;j<WIDTH;j++)
		{
			sum+=x[j];
		}

		rms[k]=sum;//to be used to calculate media
		if ((rms[k]<100*WIDTH)||(rms[k]>101*WIDTH)) printf("Warning-1: rms[%i]=%lg\n",k,rms[k]);

		barrier(CLK_GLOBAL_MEM_FENCE ); //to give time to all rms[k] be filled
		barrier(CLK_LOCAL_MEM_FENCE ); 
		

		if (k==0)
		{
			sum=0.0;
			for (int j=0;j<HEIGHT;j++)
			{
				if ((rms[j]<100*WIDTH)||(rms[j]>101*WIDTH)) printf("Warning-2: rms[%i]=%lg\n",j,rms[j]); 
				sum+=rms[j];
			}
			rms[0]=sum/(double) WIDTH/(double) HEIGHT;
			printf("GPU sum=%lg\n",sum);
			printf("GPU media=%lg\n",rms[0]);
		}
		else
			if ((rms[k]<100*WIDTH)||(rms[k]>101*WIDTH)) printf("Warning-3: rms[%i]=%lg\n",k,rms[k]);
...

推荐答案

我对这个解决方案不是很满意:

Opencl 1.2正如我所说的那样,不允许在所有工作组之间进行同步,因此必须退出内核并输入一个新工作组以使用来自所有工作项的数据。



如果有人知道如何在新的openCL 2.x标准中做到这一点,我将不胜感激。



幸运的是,对于Cuda男孩来说,它可以让所有设备同步而不用去出内核。如果有人尝试将Cuda的代码翻译成Opencl,必须考虑到这一点!
I am not very happy with this solution:
Opencl 1.2 does not allow synchronize across all work groups, as I stated, so it must be going out of kernel and enter in a new one to use data from all work items.

If somebody know how to do it in the new openCL 2.x standard I would appreciate it.

Fortunately for Cuda boys it allows synchronize along all the device without going out the kernel. This must be taken in account if somebody try translate code from Cuda to Opencl!


这篇关于Opencl障碍无效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆