最好的办法,以实现FIFO中的OpenCL内核 [英] Best approach to FIFO implementation in a kernel OpenCL

查看:207
本文介绍了最好的办法,以实现FIFO中的OpenCL内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标:实现OpenCL中与下图。从OpenCL的内核所需要的最主要的是要乘以系数阵列和临时数组,然后accumilate所有这些值转换成一个在末端。 (这可能是最耗费时间的操作,并行将是真正有用的在这里)。

Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).

我使用了,做乘法和加法(我希望这个功能将平行以及)。

I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).

图片说明:

一次一个的值传递到数组(临时数组)这是大小系数数组一样。现在的每次单个值传递到这个数组,临时数组乘以并行系数数组,然后各指标的值被串联成一个单一的元素。这将继续下去,直到输入数组达到它的最后一个元素。

One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.

在这里输入的形象描述

我的code,会发生什么?

有关从输入60的元素,它接管8000毫秒!!和我一共有120万的投入仍然有获得通过。我知道一个事实,那就是一种更好的解决方案做我试图。这里是我的低于code。

For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.

这里有一些事情,我知道错了,他与code是肯定的。当我试着使用临时数组乘以系数的值,它崩溃。这是因为,global_id的。我想要这行做的仅仅是繁衍两个数组并行。

Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.

我试图弄清楚它为什么这么长时间做FIFO功能,所以我就开始评论系出。我首先第一个除外的FIFO功能的回路评论一切开始。因此,此花了50毫秒。然后,当我注释掉了下一个循环,它跃升至8000MS。所以延迟将与数据传输有关。

I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.

有没有办法,我可以在OpenCL的使用寄存器转移?也许使用整型数组某种逻辑移位方法? (我知道有一个>>运算符)。

Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).

float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];

float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {

//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end

int globalId = get_global_id(0); 

float output = 0.0f;

//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
    tempArrayForShift[i] = temp[i];
}

//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
    temp[i] = tempArrayForShift[i];
}

//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];

//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i <  58; i ++) {
//  output = multipliedResult[i] + output;
}

//Returned summed value of temp array
return output;
}


__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { 

//Initialize the temporary array values to 0
for (int i = 0; i <  58; i ++) {
    temp[i] = 0;
    tempArrayForShift[i] = 0;
    multipliedResult[i] = 0;
}

//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i <  60; i ++) {
    Output[i] = fifo(Array[i], coefficients, 58);
}

}

我有这个问题,OpenCL的很长一段时间。我不知道如何实现并行和串行指令在一起。

I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.

另一种选择,我想

在主CPP文件,我在想有执行FIFO缓冲器,并具有内核做乘法和加法。但是,这将意味着我将不得不调用内核1000+次循环。这会是更好的解决方案呢?还是会仅仅是完全无效的。

In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.

推荐答案

要获得良好表现出来的GPU,您需要将您的工作并行,以多个线程。在您的code你只是使用一个线程和GPU是每个线程很慢,但可以非常快,如果许多线程在同时运行。在这种情况下,你可以使用一个线程为每个输出值。你实际上并不需要通过阵列转移值:58值的窗口被认为是每一个输出值,你可以从内存中抓取这些值,与系数相乘,写回结果。

To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.

一个简单的实现将是(启动与尽可能多线程的输出值):

A simple implementation would be (launch with as many threads as output values):

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) 
{ 
    int globalId = get_global_id(0); 
    float sum=0.0f;
    for (int i=0; i< 58; i++)
    {
        float tmp=0;
        if (globalId+i > 56)
        {
            tmp=Array[i+globalId-57]*coefficient[57-i];
        }
        sum += tmp;
    }
    output[globalId]=sum;
}

这是不完美的,因为它产生的内存访问模式是不是最佳的图形处理器。缓存可能会有点帮助,但显然还有很大的优化空间,因为值多次重复使用。你正在试图执行的操作被称为卷积(1D)。 NVIDIA已经在他们的GPU计算SDK称为oclConvolutionSeparable的2D例子,显示了一个优化版本。你适应用他们的convolutionRows内核一维卷积。

This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.

这篇关于最好的办法,以实现FIFO中的OpenCL内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆