将 FFT 计算卸载到嵌入式 GPU 是否值得? [英] Is it worth offloading FFT computation to an embedded GPU?

查看:34
本文介绍了将 FFT 计算卸载到嵌入式 GPU 是否值得?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在考虑将应用程序从专用数字信号处理芯片移植到通用 x86 硬件上.该应用程序进行了大量的傅立叶变换,从简短的研究来看,FFT 似乎非常适合在 GPU 而不是 CPU 上进行计算.例如,此页面有一些使用 Core 2 Quad 和 GF 8800 GTX 的基准,显示使用 GPU 时计算时间减少了 10 倍:

http://www.cv.nrao.edu/~pdemores/gpu/

但是,在我们的产品中,尺寸限制将我们限制在 PC104 或 Mini-ITX 等小尺寸设备上,因此只能使用相当有限的嵌入式 GPU.

将计算卸载到 GPU 上是否值得在适当的 PCIe 总线上使用大量显卡,或者甚至嵌入式 GPU 会提供性能改进?

解决方案

在 x86 硬件和 GPU 上开发过 FFT 例程(在 CUDA 之前,7800 GTX 硬件) 我从自己的结果中发现,FFT 尺寸越小(低于 2^13),CPU 速度就越快.超过这些尺寸,GPU 速度更快.例如,2^16 大小的 FFT 在 GPU 上的计算速度比在 CPU 上的等效变换快 2-4 倍.请参阅下面的时间表(所有时间都以秒为单位,比较 3GHz Pentium 4 与 7800GTX.这项工作是在 2005 年完成的,所以旧硬件,正如我所说,非 CUDA.较新的库可能会显示出更大的改进)

<上一页>N FFTw (s) GPUFFT (s) GPUFFT MFLOPS GPUFFT 加速8 0 0.00006 3.352705 0.00688116 0.000001 0.000065 7.882117 0.01021732 0.000001 0.000075 17.10887 0.01469564 0.000002 0.000085 36.080118 0.026744128 0.000004 0.000093 76.724324 0.040122256 0.000007 0.000107 153.739856 0.066754512 0.000015 0.000115 320.200892 0.1346141024 0.000034 0.000125 657.735381 0.2705122048 0.000076 0.000156 1155.151507 0.4843314096 0.000173 0.000215 1834.212989 0.8045588192 0.000483 0.00032 2664.042421 1.51001116384 0.001363 0.000605 3035.4551 2.25541132768 0.003168 0.00114 3450.455808 2.78004165536 0.008694 0.002464 3404.628083 3.528726131072 0.015363 0.005027 3545.850483 3.05604262144 0.033223 0.012513 3016.885246 2.655183524288 0.072918 0.025879 3079.443664 2.8176671048576 0.173043 0.076537 2192.056517 2.2609042097152 0.331553 0.157427 2238.01491 2.1060814194304 0.801544 0.430518 1715.573229 1.861814

正如其他海报所建议的那样,与 GPU 之间的数据传输是您受到的打击.可以在 CPU 上执行更小的 FFT,某些实现/大小完全在缓存中.这使得 CPU 成为小型 FFT(低于 ~1024 点)的最佳选择.另一方面,如果您需要对数据执行大批量的工作,而与 GPU 之间的移动最少,那么 GPU 将击败 CPU.

如果您想要快速的 FFT 实现,我建议使用 FFTW,如果您想要更快的(商业)实现,我建议使用英特尔数学库.对于 FFTW,使用 FFTW_Measure 标志执行计划将为您的特定硬件测量和测试最快的 FFT 例程.我在 这个问题.

对于 GPU 实现,您无法比 NVidia CUDA 提供的更好.自从我在 7800GTX 上进行实验后,GPU 的性能有了显着提高,所以我建议您根据自己的特定要求使用他们的 SDK.

We are considering porting an application from a dedicated digital signal processing chip to run on generic x86 hardware. The application does a lot of Fourier transforms, and from brief research, it appears that FFTs are fairly well suited to computation on a GPU rather than a CPU. For example, this page has some benchmarks with a Core 2 Quad and a GF 8800 GTX that show a 10-fold decrease in calculation time when using the GPU:

http://www.cv.nrao.edu/~pdemores/gpu/

However, in our product, size constraints restrict us to small form factors such as PC104 or Mini-ITX, and thus to rather limited embedded GPUs.

Is offloading computation to the GPU something that is only worth doing with meaty graphics cards on a proper PCIe bus, or would even embedded GPUs offer performance improvements?

解决方案

Having developed FFT routines both on x86 hardware and GPUs (prior to CUDA, 7800 GTX Hardware) I found from my own results that with smaller sizes of FFT (below 2^13) that the CPU was faster. Above these sizes the GPU was faster. For instance, a 2^16 sized FFT computed an 2-4x more quickly on the GPU than the equivalent transform on the CPU. See a table of times below (All times are in seconds, comparing a 3GHz Pentium 4 vs. 7800GTX. This work was done back in 2005 so old hardware and as I said, non CUDA. Newer libraries may show larger improvements)

N       FFTw (s)    GPUFFT (s)  GPUFFT MFLOPS   GPUFFT Speedup
8       0           0.00006     3.352705        0.006881
16      0.000001    0.000065    7.882117        0.010217
32      0.000001    0.000075    17.10887        0.014695
64      0.000002    0.000085    36.080118       0.026744
128     0.000004    0.000093    76.724324       0.040122
256     0.000007    0.000107    153.739856      0.066754
512     0.000015    0.000115    320.200892      0.134614
1024    0.000034    0.000125    657.735381      0.270512
2048    0.000076    0.000156    1155.151507     0.484331
4096    0.000173    0.000215    1834.212989     0.804558
8192    0.000483    0.00032     2664.042421     1.510011
16384   0.001363    0.000605    3035.4551       2.255411
32768   0.003168    0.00114     3450.455808     2.780041
65536   0.008694    0.002464    3404.628083     3.528726
131072  0.015363    0.005027    3545.850483     3.05604
262144  0.033223    0.012513    3016.885246     2.655183
524288  0.072918    0.025879    3079.443664     2.817667
1048576 0.173043    0.076537    2192.056517     2.260904
2097152 0.331553    0.157427    2238.01491      2.106081
4194304 0.801544    0.430518    1715.573229     1.861814

As suggested by other posters the transfer of data to/from the GPU is the hit you take. Smaller FFTs can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small FFTs (below ~1024 points). If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.

I would suggest using FFTW if you want a fast FFT implementation, or the Intel Math Library if you want an even faster (commercial) implementation. For FFTW, performing plans using the FFTW_Measure flag will measure and test the fastest possible FFT routine for your specific hardware. I go into detail about this in this question.

For GPU implementations you can't get better than the one provided by NVidia CUDA. The performance of GPUs has increased significantly since I did my experiments on a 7800GTX so I would suggest giving their SDK a go for your specific requirement.

这篇关于将 FFT 计算卸载到嵌入式 GPU 是否值得?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆