在实际输入数据高效的2D FFT? [英] Efficient 2D FFT on real input data?

查看:165
本文介绍了在实际输入数据高效的2D FFT?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在实施一个二维FFT真正的输入数据使用的OpenCL(更具体使用的FFT快速二维卷积,所以我只需要的东西,其行为足以类似地应用卷积)。二维FFT是在COLS使用一维FFT的行,之后的一维FFT来实现。

I'm currently implementing a two dimensional FFT for real input data using opencl (more specifically a fast 2D convolution using FFTs, so I only need something which behaves similary enough to apply the convolution to). The 2D FFT is implemented using an 1D FFT on the rows and afterwards an 1D FFT on the cols.

为了使这更有效,我尝试使用的FFT与实际输入的对称性,以便能够计算更小的FFT。我发现我可以将两排为一体,采用先为实部,第二个是虚部,做的第一个一维FFT结果行,然后使用对称特性来构建个人的维FFT结果行从。所以我在做什么,基本上如下:

To make this more efficient I'm trying to use the symmetries of FFTs with real input in order to be able to calculate smaller FFTs. I found that I can combine two rows into one, using the first as real component, the second as imaginary component, do the first 1D FFT on the resulting row and then use the symmetry properties to construct the results of the 1D FFTs of the individual rows from that. So what I'm doing is basically the following:

F 先按g 从矩阵行。

  1. 构造 X = F + I * G
  2. 变换得到 F(X)= F(F)+ I * F(G)
  3. 使用对称性提取 F(F) F(G) F (X)
  1. Construct x = f + i * g
  2. Transform to get F(x) = F(f) + i * F(g)
  3. Use symmetries to extract F(f) and F(g) from F(x)

我却无法仅仅输入的结果直接进入第二维FFT,因为在这种情况下我也不会改变整个矩阵,而是两个子矩阵代替。然而,转换之间提取数据意味着要么存储更多的数据( N / 2 + 1 需要EX preSS的一维FFT的实际输入的结果条目)或结合元素在指数 0 和指数 N / 2 到一个元素(使用同样的伎俩结合,因为这两个数字保证是实数),并使用相同的存储量,而且必须做出spcial情况下,在我的卷积

I can't however just input the results directly into the 2nd 1D FFT, because in that case I would not transform the whole matrix, but two submatrices instead. However extracting the data between the transformations means either storing more data (n/2+1 entries needed to express the result of an 1D FFT on real input) or combine the elements at index 0 and index n/2 into one element (combining using the same trick, since both numbers are guaranteed to be real) and use the same amount of storage but have to make a spcial case for that in my convolution.

由于我尝试使用更多的存储重用缓冲区尽可能(由于内存有限菱在GPU上的)是不是一个很好的解决方案。而且我的算法都没有配备工作matrixsizes这是不是16的2 /倍数功率(变化从内核到内核)。我宁愿避免引入特殊的情况下,不是,因为这些会使我的内核更复杂的伤害效率(我已经遇到了麻烦,尽量减少使用的每个内核寄存器计数)。

Since I try to reuse buffers as much as possible (due to limited RAM availible on the gpu) using more storage isn't a nice solution. Furthermore my algorithms are not equipped to work on matrixsizes which are not power of 2 / multiples of 16 (varies from kernel to kernel). I would rather avoid introducing special cases either, since those would make my kernels more complex hurting efficiency (I'm already having trouble to minimize the register count used by each kernel).

所以我的问题是,如果有一种优雅的方法这个问题,这意味着其中一个将工作没有任何使用更多的内存或特殊情况下,某些元素?

So my question is if there is an elegant approach to this problem, meaning one which will work without either using more memory or special cases for certain elements?

在理想情况下,我想能够做到整个FFT不分割我的组合数据的FFT的中间,但我不知道这就是可能的。

Ideally I would like to be able to do the whole FFT without splitting my combined data in the middle of the FFT, but I'm not sure thats possible.

推荐答案

嗯......我的两个引用:

Hmmm...my two references are:

http://www.engineeringproductivitytools.com/stuff/T0001/PT10.HTM http://images.apple.com/acg/pdf/FFTapps_20090909.pdf

http://www.engineeringproductivitytools.com/stuff/T0001/PT10.HTM http://images.apple.com/acg/pdf/FFTapps_20090909.pdf

我觉得承诺是埃尔米特数据结构,用0和N / 2挤进第一个元素的值是要走的路,因为前/反向和埃尔米特结构将制定出更好的。

I think that committing to a "hermitian" data structure, with the 0 and n/2 values packed into the first element is the way to go, as forward/inverse and hermitian structures will work out better.

这样的话,你有rUnWrap(FFT(n / 2个,甚至(x)的+ I *奇数(X)))= rFFT(x)的,并且riFFT可以在埃尔米特阵列上的工作,产生一对阵列偶奇,这又使原有的结构。

That way, you have rUnWrap(FFT(n/2, Even(x) + i*Odd(x)))= rFFT(x), and the riFFT can work on the "hermitian" array, producing the pair of arrays Even and Odd, which again gives the original structure.

还有其他的采样是可以做的,从而使原来的数组被分成 4 N / 2×N个/ 2阵列,带根(0,0),(0,1),(1,0),(1,1),然后包裹在末尾,使用最终基数4 通...也许这就是对于GPU内存更好的......我不知道。

There are also other samplings that can be done, whereby the the original array is broken into 4 n/2xn/2 arrays, rooted at (0,0),(0,1),(1,0),(1,1) and then wrapped up at the end, using a final radix-4 pass...perhaps that is better for the GPU memory...I don't know.

阿伦

这篇关于在实际输入数据高效的2D FFT?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆