在GPU上运行FFTW,使用CUFFT [英] running FFTW on GPU vs using CUFFT

查看:462
本文介绍了在GPU上运行FFTW,使用CUFFT的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个基本的C ++ FFTW实现,如下所示:

  for(int i = 0; i  //声明指针并计划
fftw_complex * in,* out;
fftw_plan p;

//分配
in =(fftw_complex *)fftw_malloc(sizeof(fftw_complex)* N);
out =(fftw_complex *)fftw_malloc(sizeof(fftw_complex)* N);

//初始化
...

//创建计划
p = fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE) ;

//执行计划
fftw_execute(p);

//清理
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
}

我在for循环中使用N fft。我知道我可以使用FFTW一次执行很多计划,但是在我的实现和 out 中,每个循环都是不同的。关键是我正在做一个for循环的整个FFTW流水线。



我想转换到使用CUDA加快速度。我理解CUDA有自己的FFT库CUFFT。语法非常相似:从他们的在线文档

  #define NX 64 
#define NY 64
#define NZ 128

cufftHandle plan;
cufftComplex * data1,* data2;
cudaMalloc((void **)& data1,sizeof(cufftComplex)* NX * NY * NZ);
cudaMalloc((void **)& data2,sizeof(cufftComplex)* NX * NY * NZ);
/ *创建一个3D FFT计划。 * /
cufftPlan3d(& plan,NX,NY,NZ,CUFFT_C2C);

/ *转换第一个信号到位。 * /
cufftExecC2C(plan,data1,data1,CUFFT_FORWARD);

/ *使用相同的计划转换第二个信号。 * /
cufftExecC2C(plan,data2,data2,CUFFT_FORWARD);

/ *销毁cuFFT计划。 * /
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);然而,每个这些内核(如Nvida所说的)(cufftPlan3d,cufftExecC2C等等)。 )是来自GPU的调用。如果我正确理解CUDA结构,这些方法调用中的每一个都是单独并行操作:

  #define NX 64 
#define NY 64
#define NZ 128

cufftHandle plan;
cufftComplex * data1,* data2;
cudaMalloc((void **)& data1,sizeof(cufftComplex)* NX * NY * NZ);
cudaMalloc((void **)& data2,sizeof(cufftComplex)* NX * NY * NZ);
/ *创建一个3D FFT计划。 * /
cufftPlan3d(& plan,NX,NY,NZ,CUFFT_C2C); //在GPU上并行运行,然后返回CPU

/ *转换第一个信号到位。 * /
cufftExecC2C(plan,data1,data1,CUFFT_FORWARD); //在GPU上并行运行,然后返回CPU

/ *使用相同的计划转换第二个信号。 * /
cufftExecC2C(plan,data2,data2,CUFFT_FORWARD); //在并行处理GPU上,然后返回CPU

/ *销毁cuFFT计划。 * /
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);



我理解如何通过在GPU上运行每个FFT步骤来加速我的代码。但是,如果我想并行化我的整个for循环?如果我想让我的每个原始的N for循环在GPU上运行整个FFTW流水线?我可以创建一个自定义的内核,并从设备(GPU)调用FFTW方法?

解决方案

设备代码。 FFTW库是由x86代码编译的,不会在GPU上运行。



如果代码中的重载在FFT操作中,的相当大的大小,那么只是调用cufft库例程指示应该给你良好的加速和大致充分利用机器。一旦机器被完全利用,尝试并行运行更多东西通常没有额外的好处。



cufft例程可以由多个主机线程调用,因此可以对多个独立变换的cufft进行多个调用。如果单个转换大到足以利用机器,那么你不太可能看到这么快的速度。



cufft还支持批量计划,这是另一种立即执行多个转换的方法。


I have a basic C++ FFTW implementation that looks like this:

for (int i = 0; i < N; i++){
     // declare pointers and plan
     fftw_complex *in, *out;
     fftw_plan p;

     // allocate 
     in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
     out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

     // initialize "in"
     ...

     // create plan
     p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);

     // execute plan
     fftw_execute(p);

     // clean up
     fftw_destroy_plan(p);
     fftw_free(in); fftw_free(out);
}

I'm doing N fft's in a for loop. I know I can execute many plans at once with FFTW, but in my implementation in and out are different every loop. The point is I'm doing the entire FFTW pipeline INSIDE a for loop.

I want to transition to using CUDA to speed this up. I understand that CUDA has its own FFT library CUFFT. The syntax is very similar: From their online documentation:

#define NX 64
#define NY 64
#define NZ 128

cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);

/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);

/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);

/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);

However, each of these "kernels" (as Nvida calls them) (cufftPlan3d, cufftExecC2C, etc.) are calls to-and-from the GPU. If I understand the CUDA structure correctly, each of these method calls are INDIVIDUALLY parallelized operations:

#define NX 64
#define NY 64
#define NZ 128

cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU

/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU

/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU

/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);

I understand how this can speed up my code by running each FFT step on a GPU. But, what if I want to parallelize my entire for loop? What if I want each of my original N for loops to run the entire FFTW pipeline on the GPU? Can I create a custom "kernel" and call FFTW methods from the device (GPU)?

解决方案

You cannot call FFTW methods from device code. The FFTW libraries are compiled x86 code and will not run on the GPU.

If the "heavy lifting" in your code is in the FFT operations, and the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine. Once the machine is fully utilized, there is generally no additional benefit to trying to run more things in parallel.

cufft routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.

cufft also supports batched plans which is another way to execute multiple transforms "at once".

这篇关于在GPU上运行FFTW,使用CUFFT的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆