为什么Windows上的FFTW比Linux上的快? [英] Why FFTW on Windows is faster than on Linux?

查看:111
本文介绍了为什么Windows上的FFTW比Linux上的快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用fftw库(fftw3.afftw3.lib)在Linux和Windows中编写了两个相同的程序,并计算了fftwf_execute(m_wfpFFTplan)语句的持续时间(16-fft).

I wrote two identical programs in Linux and Windows using the fftw libraries (fftw3.a, fftw3.lib), and compute the duration of the fftwf_execute(m_wfpFFTplan) statement (16-fft).

对于10000次运行:

For 10000 runs:

  • 在Linux上:平均时间为0.9
  • 在Windows上:平均时间为0.12

我对为什么Windows上的速度比Linux上的速度快9倍感到困惑.

I am confused as to why this is nine times faster on Windows than on Linux.

处理器:Intel(R)Core(TM)i7 CPU 870 @ 2.93GHz

Processor: Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz

每个操作系统(Windows XP 32位和Linux OpenSUSE 11.4 32位)都安装在同一台计算机上.

Each OS (Windows XP 32 bit and Linux OpenSUSE 11.4 32 bit) are installed on same machines.

我从互联网上下载了fftw.lib(适用于Windows),但不知道该配置.使用此配置构建FFTW之后:

I downloaded the fftw.lib (for Windows) from internet and don't know that configurations. Once I build FFTW with this config:

/configure --enable-float  --enable-threads --with-combined-threads  --disable-fortran  --with-slow-timer  --enable-sse  --enable-sse2  --enable-avx   

在Linux中,它生成的库比默认配置(0.4毫秒)快四倍.

in Linux and it results in a lib that is four times faster than the default configs (0.4 ms).

推荐答案

16 FFT非常小.您会发现,FFT小于64,将是没有循环的硬编码汇编程序,以获得最高的性能.这意味着它们极易受到指令集,编译器优化甚至64位或32位字的变化的影响.

16 FFT is very small. What you will find is FFTs smaller than say 64 will be hard coded assembler with no loops to get the highest possible performance. This means they can be highly susceptible to variations in instruction sets, compiler optimisations, even 64 or 32bit words.

以2的幂进行FFT大小从16-> 1048576的测试时会发生什么?我说这是在Linux上作为特定的硬编码asm例程,可能不是对您的机器进行最佳优化的方法,但是对于这种特定大小的Windows实现,您可能很幸运.比较此范围内的所有大小,可以更好地说明Linux与Windows的性能.

What happens when you run a test of FFT sizes from 16 -> 1048576 in powers of 2? I say this as a particular hard-coded asm routine on Linux might not be the best optimized for your machine, whereas you might have been lucky on the Windows implementation for that particular size. A comparison of all sizes in this range will give you a better indication of the Linux vs. Windows performance.

您是否已校准FFTW?首次运行FFTW时,猜测每台机器的最快实现速度,但是,如果您有特殊的指令集,特定大小的高速缓存或其他处理器功能,则这些都会对执行速度产生巨大影响.结果,执行校准将测试各种FFT例程的速度,并为您的特定硬件选择每种尺寸最快的速度.校准涉及重复计算计划并保存生成的FFTW智慧"文件.然后可以重复使用保存的校准数据(这是一个漫长的过程).我建议您在软件启动时执行一次,然后每次重新使用该文件.校准后,我注意到某些尺寸的性能提高了4-10倍!

Have you calibrated FFTW? When first run FFTW guesses the fastest implementation per machine, however if you have special instruction sets, or a particular sized cache or other processor features then these can have a dramatic effect on execution speed. As a result performing a calibration will test the speed of various FFT routines and choose the fastest per size for your specific hardware. Calibration involves repeatedly computing the plans and saving the FFTW "Wisdom" file generated. The saved calibration data (this is a lengthy process) can then be re-used. I suggest doing it once when your software starts up and re-using the file each time. I have noticed 4-10x performance improvements for certain sizes after calibrating!

下面是我用来校准某些尺寸的FFTW的一小段代码.请注意,此代码是逐字逐句从我处理过的DSP库中粘贴的,因此某些函数调用特定于我的库.希望FFTW特定的调用对您有所帮助.

Below is a snippet of code I have used to calibrate FFTW for certain sizes. Please note this code is pasted verbatim from a DSP library I worked on so some function calls are specific to my library. I hope the FFTW specific calls are helpful.

// Calibration FFTW
void DSP::forceCalibration(void)
{
// Try to import FFTw Wisdom for fast plan creation
FILE *fftw_wisdom = fopen("DSPDLL.ftw", "r");

// If wisdom does not exist, ask user to calibrate
if (fftw_wisdom == 0)
{
    int iStatus2 = AfxMessageBox("FFTw not calibrated on this machine."\
        "Would you like to perform a one-time calibration?\n\n"\
        "Note:\tMay take 40 minutes (on P4 3GHz), but speeds all subsequent FFT-based filtering & convolution by up to 100%.\n"\
        "\tResults are saved to disk (DSPDLL.ftw) and need only be performed once per machine.\n\n"\
        "\tMAKE SURE YOU REALLY WANT TO DO THIS, THERE IS NO WAY TO CANCEL CALIBRATION PART-WAY!", 
        MB_YESNO | MB_ICONSTOP, 0);

    if (iStatus2 == IDYES)
    {
        // Perform calibration for all powers of 2 from 8 to 4194304
        // (most heavily used FFTs - for signal processing)
        AfxMessageBox("About to perform calibration.\n"\
            "Close all programs, turn off your screensaver and do not move the mouse in this time!\n"\
            "Note:\tThis program will appear to be unresponsive until the calibration ends.\n\n"
            "\tA MESSAGEBOX WILL BE SHOWN ONCE THE CALIBRATION IS COMPLETE.\n");
        startTimer();

        // Create a whole load of FFTw Plans (wisdom accumulates automatically)
        for (int i = 8; i <= 4194304; i *= 2)
        {
            // Create new buffers and fill
            DSP::cFFTin = new fftw_complex[i];
            DSP::cFFTout = new fftw_complex[i];
            DSP::fconv_FULL_Real_FFT_rdat = new double[i];
            DSP::fconv_FULL_Real_FFT_cdat = new fftw_complex[(i/2)+1];
            for(int j = 0; j < i; j++)
            {
                DSP::fconv_FULL_Real_FFT_rdat[j] = j;
                DSP::cFFTin[j][0] = j;
                DSP::cFFTin[j][1] = j;
                DSP::cFFTout[j][0] = 0.0;
                DSP::cFFTout[j][1] = 0.0;
            }

            // Create a plan for complex FFT.
            // Use the measure flag to get the best possible FFT for this size
            // FFTw "remembers" which FFTs were the fastest during this test. 
            // at the end of the test, the results are saved to disk and re-used
            // upon every initialisation of the DSP Library
            DSP::pCF = fftw_plan_dft_1d
                (i, DSP::cFFTin, DSP::cFFTout, FFTW_FORWARD, FFTW_MEASURE);

            // Destroy the plan
            fftw_destroy_plan(DSP::pCF);

            // Create a plan for real forward FFT
            DSP::pCF = fftw_plan_dft_r2c_1d
                (i, fconv_FULL_Real_FFT_rdat, fconv_FULL_Real_FFT_cdat, FFTW_MEASURE);

            // Destroy the plan
            fftw_destroy_plan(DSP::pCF);

            // Create a plan for real inverse FFT
            DSP::pCF = fftw_plan_dft_c2r_1d
                (i, fconv_FULL_Real_FFT_cdat, fconv_FULL_Real_FFT_rdat, FFTW_MEASURE);

            // Destroy the plan
            fftw_destroy_plan(DSP::pCF);

            // Destroy the buffers. Repeat for each size
            delete [] DSP::cFFTin;
            delete [] DSP::cFFTout;
            delete [] DSP::fconv_FULL_Real_FFT_rdat;
            delete [] DSP::fconv_FULL_Real_FFT_cdat;
        }

        double time = stopTimer();

        char * strOutput;
        strOutput = (char*) malloc (100);
        sprintf(strOutput, "DSP.DLL Calibration complete in %d minutes, %d seconds\n"\
            "Please keep a copy of the DSPDLL.ftw file in the root directory of your application\n"\
            "to avoid re-calibration in the future\n", (int)time/(int)60, (int)time%(int)60);
        AfxMessageBox(strOutput);

        isCalibrated = 1;

        // Save accumulated wisdom
        char * strWisdom = fftw_export_wisdom_to_string();  
        FILE *fftw_wisdomsave = fopen("DSPDLL.ftw", "w");
        fprintf(fftw_wisdomsave, "%s", strWisdom);

        fclose(fftw_wisdomsave);
        DSP::pCF = NULL;
        DSP::cFFTin = NULL;
        DSP::cFFTout = NULL;
        fconv_FULL_Real_FFT_cdat = NULL;
        fconv_FULL_Real_FFT_rdat = NULL;
        free(strOutput);
    }
}
else 
{
    // obtain file size.
    fseek (fftw_wisdom , 0 , SEEK_END);
    long lSize = ftell (fftw_wisdom);
    rewind (fftw_wisdom);

    // allocate memory to contain the whole file.
    char * strWisdom = (char*) malloc (lSize);

    // copy the file into the buffer.
    fread (strWisdom,1,lSize,fftw_wisdom);

    // import the buffer to fftw wisdom
    fftw_import_wisdom_from_string(strWisdom);

    fclose(fftw_wisdom);
    free(strWisdom);

    isCalibrated = 1;

    return;
}
}

秘诀在于使用FFTW_MEASURE标志创建计划,该标志专门测量数百个例程,以针对您的特定FFT类型(实数,复数,一维,二维和二维)和大小找到最快的程序:

The secret sauce is to create the plan using the FFTW_MEASURE flag, which specifically measures hundreds of routines to find the fastest for your particular type of FFT (real, complex, 1D, 2D) and size:

DSP::pCF = fftw_plan_dft_1d (i, DSP::cFFTin, DSP::cFFTout, 
   FFTW_FORWARD, FFTW_MEASURE);

最后,所有基准测试也应在execute之外的单个FFT计划阶段执行,该阶段从以释放模式编译且具有优化功能并与调试器分离的代码中调用.基准测试应在具有数千(甚至数百万)次迭代的循环中执行,然后使用平均运行时间来计算结果.您可能知道计划阶段会花费大量时间,并且执行被设计为使用一个计划多次执行.

Finally, all benchmark tests should also be performed with a single FFT Plan stage outside of execute, called from code that is compiled in release mode with optimizations on and detached from the debugger. Benchmarks should be performed in a loop with many thousands (or even millions) of iterations and then take the average run time to compute the result. As you probably know the planning stage takes a significant amount of time and the execute is designed to be performed multiple times with a single plan.

这篇关于为什么Windows上的FFTW比Linux上的快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆