FFTW与Matlab FFT [英] FFTW vs Matlab FFT

查看:159
本文介绍了FFTW与Matlab FFT的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在matlab Central上发布了此内容,但没有得到任何回应,因此我认为我将在此处重新发布.

I posted this on matlab central but didn't get any responses so I figured I'd repost here.

我最近在Matlab中编写了一个简单的例程,该例程在for循环中使用FFT; FFT支配了计算.我出于实验目的在mex中编写了相同的例程,该例程称为FFTW 3.3库.事实证明,对于非常大的阵列,matlab例程的运行速度比mex例程的运行速度快(大约快一倍).混合例程使用智慧并执行相同的FFT计算.我也知道matlab使用FFTW,但是它们的版本是否可能会稍微优化一些?我什至使用了FFTW_EXHAUSTIVE标志,对于大型数组,它的速度仍然是MATLAB对应标志的两倍.此外,我确保使用的matlab是带有"-singleCompThread"标志的单线程,并且使用的mex文件不在调试模式下.只是想知道是否是这种情况-还是在我不知道的情况下matlab正在使用某些优化.谢谢.

I recently wrote a simple routine in Matlab that uses an FFT in a for-loop; the FFT dominates the calculations. I wrote the same routine in mex just for experimentation purposes and it calls the FFTW 3.3 library. It turns out that the matlab routine runs faster than the mex routine for very large arrays (about twice as fast). The mex routine uses wisdom and and performs the same FFT calculations. I also know matlab uses FFTW, but is it possible their version is slightly more optimized? I even used the FFTW_EXHAUSTIVE flag and its still about twice as slow for large arrays than the MATLAB counterpart. Furthermore I ensured the matlab I used was single threaded with the "-singleCompThread" flag and the mex file I used was not in debug mode. Just curious if this was the case - or if there are some optimizations matlab is using under the hood that I dont know about. Thanks.

这是混合部分:

void class_cg_toeplitz::analysis() {
// This method computes CG iterations using FFTs
    // Check for wisdom
    if(fftw_import_wisdom_from_filename("cd.wis") == 0) {
        mexPrintf("wisdom not loaded.\n");
    } else {
        mexPrintf("wisdom loaded.\n");
    }

    // Set FFTW Plan - use interleaved FFTW
    fftw_plan plan_forward_d_buffer;    
    fftw_plan plan_forward_A_vec;       
    fftw_plan plan_backward_Ad_buffer;
    fftw_complex *A_vec_fft;
    fftw_complex *d_buffer_fft;
    A_vec_fft = fftw_alloc_complex(n);
    d_buffer_fft = fftw_alloc_complex(n);

    // CREATE MASTER PLAN - Do this on an empty vector as creating a plane 
    // with FFTW_MEASURE will erase the contents; 
    // Use d_buffer
    // This is somewhat dangerous because Ad_buffer is a vector; but it does not
    // get resized so &Ad_buffer[0] should work
    plan_forward_d_buffer = fftw_plan_dft_r2c_1d(d_buffer.size(),&d_buffer[0],d_buffer_fft,FFTW_EXHAUSTIVE);
    plan_forward_A_vec = fftw_plan_dft_r2c_1d(A_vec.height,A_vec.value,A_vec_fft,FFTW_WISDOM_ONLY);
    // A_vec_fft.*d_buffer_fft will overwrite d_buffer_fft
    plan_backward_Ad_buffer = fftw_plan_dft_c2r_1d(Ad_buffer.size(),d_buffer_fft,&Ad_buffer[0],FFTW_EXHAUSTIVE);

    // Get A_vec_fft
    fftw_execute(plan_forward_A_vec);

    // Find initial direction - this is the initial residual
    for (int i=0;i<n;i++) {
        d_buffer[i] = b.value[i];
        r_buffer[i] = b.value[i];
    }    

    // Start CG iterations
    norm_ro = norm(r_buffer);
    double fft_reduction = (double)Ad_buffer.size(); // Must divide by size of vector because inverse FFT does not do this
    while (norm(r_buffer)/norm_ro > relativeresidual_cutoff) {        
        // Find Ad - use fft
        fftw_execute(plan_forward_d_buffer);    
        // Get A_vec_fft.*fft(d) - A_vec_fft is only real, but d_buffer_fft
        // has complex elements; Overwrite d_buffer_fft        
        for (int i=0;i<n;i++) {
            d_buffer_fft[i][0] = d_buffer_fft[i][0]*A_vec_fft[i][0]/fft_reduction;
            d_buffer_fft[i][1] = d_buffer_fft[i][1]*A_vec_fft[i][0]/fft_reduction;
        }        
        fftw_execute(plan_backward_Ad_buffer); 

        // Calculate r'*r
        rtr_buffer = 0;
        for (int i=0;i<n;i++) {
            rtr_buffer = rtr_buffer + r_buffer[i]*r_buffer[i];
        }    

        // Calculate alpha
        alpha = 0;
        for (int i=0;i<n;i++) {
            alpha = alpha + d_buffer[i]*Ad_buffer[i];
        }    
        alpha = rtr_buffer/alpha;

        // Calculate new x
        for (int i=0;i<n;i++) {
            x[i] = x[i] + alpha*d_buffer[i];
        }   

        // Calculate new residual
        for (int i=0;i<n;i++) {
            r_buffer[i] = r_buffer[i] - alpha*Ad_buffer[i];
        }   

        // Calculate beta
        beta = 0;
        for (int i=0;i<n;i++) {
            beta = beta + r_buffer[i]*r_buffer[i];
        }  
        beta = beta/rtr_buffer;

        // Calculate new direction vector
        for (int i=0;i<n;i++) {
            d_buffer[i] = r_buffer[i] + beta*d_buffer[i];
        }  

        *total_counter = *total_counter+1;
        if(*total_counter >= iteration_cutoff) {
            // Set total_counter to -1, this indicates failure
            *total_counter = -1;
            break;
        }
    }

    // Store Wisdom
    fftw_export_wisdom_to_filename("cd.wis");

    // Free fft alloc'd memory and plans
    fftw_destroy_plan(plan_forward_d_buffer);
    fftw_destroy_plan(plan_forward_A_vec);
    fftw_destroy_plan(plan_backward_Ad_buffer);
    fftw_free(A_vec_fft);
    fftw_free(d_buffer_fft);
};

这是matlab部分:

Here's the matlab portion:

% Take FFT of A_vec.
A_vec_fft = fft(A_vec); % Take fft once

% Find initial direction - this is the initial residual 
x = zeros(n,1); % search direction
r = zeros(n,1); % residual
d = zeros(n+(n-2),1); % search direction; pad to allow FFT
for i = 1:n
    d(i) = b(i); 
    r(i) = b(i); 
end

% Enter CG iterations
total_counter = 0;
rtr_buffer = 0;
alpha = 0;
beta = 0;
Ad_buffer = zeros(n+(n-2),1); % This holds the product of A*d - calculate this once per iteration and using FFT; only 1:n is used
norm_ro = norm(r);

while(norm(r)/norm_ro > 10^-6)
    % Find Ad - use fft
    Ad_buffer = ifft(A_vec_fft.*fft(d)); 

    % Calculate rtr_buffer
    rtr_buffer = r'*r;

    % Calculate alpha    
    alpha = rtr_buffer/(d(1:n)'*Ad_buffer(1:n));

    % Calculate new x
    x = x + alpha*d(1:n);

    % Calculate new residual
    r = r - alpha*Ad_buffer(1:n);

    % Calculate beta
    beta = r'*r/(rtr_buffer);

    % Calculate new direction vector
    d(1:n) = r + beta*d(1:n);      

    % Update counter
    total_counter = total_counter+1; 
end

就时间而言,对于N = 50000和b = 1:n,使用mex大约需要10.5秒,而使用matlab大约需要4.4秒.我正在使用R2011b.谢谢

In terms of time, for N = 50000 and b = 1:n it takes about 10.5 seconds with mex and 4.4 seconds with matlab. I'm using R2011b. Thanks

推荐答案

一些观察结果而不是一个确定的答案,因为我不知道MATLAB FFT实现的任何细节:

A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:

  • 根据您拥有的代码,我可以看到两种有关速度差异的解释:
    • 速度差异是通过FFT优化级别的差异来解释的
    • 在MATLAB中执行while循环的次数要少得多
    • Based on the code you have, I can see two explanations for the speed difference:
      • the speed difference is explained by differences in levels of optimization of the FFT
      • the while loop in MATLAB is executed a significantly smaller number of times

      我假设您已经研究了第二个问题,并且迭代次数是可比较的. (如果不是这样,则很可能会出现一些准确性问题,值得进一步研究.)

      I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)

      现在,关于FFT速度比较:

      Now, regarding FFT speed comparison:

      • 是的,理论上讲FFTW比其他高级FFT实施要快,但是只有在您将Apple与Apple比较时,它才有意义:在这里,您在组装级别,在更低的层次上比较实现.不仅要选择算法,还要针对特定​​处理器以及具有不同技能的软件开发人员进行实际优化
      • 在过去的一年中,我已经在许多处理器上(在基准测试行业中)对组合中的FFT进行了优化或审查,而出色的算法只是其中的一部分.有一些与您要编码的体系结构非常相关的注意事项(考虑等待时间,指令调度,寄存器使用的优化,内存中数据的排列,考虑分支采用/未采用等待时间等)并且有所不同.与选择算法一样重要.
      • 在N = 500000的情况下,我们还讨论的是大内存缓冲区:又一扇门,可以进行更多优化,这些优化可以快速针对您在其上运行代码的平台:您将如何有效地避免缓存未命中该算法所决定的不只是软件开发人员可能用来有效地将数据引入和带出内存的数据流方式和优化方式.
      • 尽管我不知道MATLAB FFT实现的细节,但我可以肯定,DSP工程师团队一直(并且仍在)对其优化进行磨练,因为它是众多设计的关键.这很可能意味着MATLAB拥有正确的开发人员组合,可以产生更快的FFT.

      这篇关于FFTW与Matlab FFT的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆