多维阵列上的PyFFTW性能 [英] PyFFTW perfomance on multidimensional arrays

查看:190
本文介绍了多维阵列上的PyFFTW性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个维度为(144,522720)的nD数组,我需要计算其FFT.

I have a nD array, say of dimensions: (144, 522720) and I need to compute its FFT.

PyFFTW似乎比numpyscipy慢,这是不期望的.

PyFFTW seems slower than numpy and scipy, that it is NOT expected.

我做错了什么吗?

下面是我的代码

import numpy
import scipy      
import pyfftw
import time

n1 = 144
n2 = 522720
loops = 2

pyfftw.config.NUM_THREADS = 4
pyfftw.config.PLANNER_EFFORT = 'FFTW_ESTIMATE'
# pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'

Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')
Q_2 = pyfftw.empty_aligned([n1, n2], dtype='complex_')
Q_ref = pyfftw.empty_aligned([n1, n2], dtype='complex_')

# repeat a few times to see if pyfft planner helps
for i in range(0,loops):
    Q_1 = numpy.random.rand(n1,n2)

    s1 = time.time()
    Q_ref = numpy.fft.fft(Q_1, axis=0)
    print('NUMPY - elapsed time: ', time.time() - s1, 's.')

    s1 = time.time()
    Q_2 = scipy.fft.fft(Q_1, axis=0)
    print('SCIPY - elapsed time: ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    Q_2 = pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0)
    print('PYFFTW NUMPY - elapsed time = ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    Q_2 = pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0)
    print('PYFFTW SCIPY - elapsed time = ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    fft_object = pyfftw.builders.fft(Q_1, axis=0)
    Q_2 = fft_object()
    print('FFTW PURE Elapsed time = ', time.time() - s1, 's')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

推荐答案

首先,如果在主循环之前打开缓存,则接口在很大程度上可以按预期工作:

Firstly, if you turn on the cache before you main loop, the interfaces work largely as expected:

pyfftw.interfaces.cache.enable()
pyfftw.interfaces.cache.set_keepalive_time(30)

有趣的是,尽管智慧应该存储的pyfftw对象的构造在关闭缓存时仍然相当慢.没关系,这正是缓存的目的.在您的情况下,由于循环很长,您需要使缓存的保持活动时间相当长.

It's interesting that despite wisdom that should be stored, the construction of the pyfftw objects is still rather slow when the cache is off. No matter, this is exactly the purpose of the cache. In your case you need to make the cache keep-alive time quite long because your loop is very long.

第二,将fft_object的构建时间包含在最终测试中是不公平的比较.如果将其移到计时器之外,则调用fft_object是更好的方法.

Secondly, it's not a fair comparison to include the construction time of the fft_object in the final test. If you move it outside the timer, then calling fft_object is a better measure.

第三,有趣的是,即使打开了缓存,对numpy_fft的调用也比对scipy_fft的调用慢.由于代码路径没有明显差异,我建议这是缓存问题.这是timeit试图缓解的问题.这是我建议的计时代码,它更有意义:

Thirdly, it's also interesting to see that even with cache turned on, the call to numpy_fft is slower than the call to scipy_fft. Since there is no obvious difference in the code path, I suggest that is caching issue. This is the sort of issue that timeit seeks to mitigate. Here's my proposed timing code which is more meaningful:

import numpy
import scipy
import pyfftw
import timeit

n1 = 144
n2 = 522720

pyfftw.config.NUM_THREADS = 4
pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'

Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')

pyfftw.interfaces.cache.enable()
pyfftw.interfaces.cache.set_keepalive_time(30)

times = timeit.repeat(lambda: numpy.fft.fft(Q_1, axis=0), repeat=5, number=1)
print('NUMPY fastest time = ', min(times))

times = timeit.repeat(lambda: scipy.fft.fft(Q_1, axis=0), repeat=5, number=1)
print('SCIPY fastest time = ', min(times))

times = timeit.repeat(
    lambda: pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0), repeat=5, number=1)
print('PYFFTW NUMPY fastest time = ', min(times))

times = timeit.repeat(
    lambda: pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0), repeat=5, number=1)
print('PYFFTW SCIPY fastest time = ', min(times))

fft_object = pyfftw.builders.fft(Q_1, axis=0)
times = timeit.repeat(lambda: fft_object(Q_1), repeat=5, number=1)
print('FFTW PURE fastest time = ', min(times))

在我的机器上,这样的输出如下:

On my machine this gives an output like:

NUMPY fastest time =  0.6622681759763509
SCIPY fastest time =  0.6572431400418282
PYFFTW NUMPY fastest time =  0.4003451430471614
PYFFTW SCIPY fastest time =  0.40362057799939066
FFTW PURE fastest time =  0.324020683998242

如果不通过将Q_1更改为complex128,不强迫它将输入数组复制为复杂数据类型,则可以做得更好:

You can do a bit better if you don't force it to copy the input array into a complex data type by changing Q_1 to be complex128:

NUMPY fastest time =  0.6483533839927986
SCIPY fastest time =  0.847397351055406
PYFFTW NUMPY fastest time =  0.3237176960101351
PYFFTW SCIPY fastest time =  0.3199474769644439
FFTW PURE fastest time =  0.2546963169006631

有趣的scipy减速是可重复的.

That interesting scipy slow-down is repeatable.

也就是说,如果您的输入是真实的,则应该进行真实的转换(使用pyfftw可使速度提高50%以上)并处理最终的复杂输出.

That said, if your input is real, you should be doing a real transform (for >50% speed-up with pyfftw) and manipulating the resultant complex output.

这个例子有趣的是(我认为)缓存在结果中有多重要(我建议这就是为什么切换到真正的转换如此有效地加快处理速度).使用阵列大小更改为524288时,您也会看到一些戏剧性的东西(二阶幂,您认为这也许会加快速度,但不会大大降低速度).在这种情况下,一切都会变慢很多,特别是scipy.在我看来,scipy对缓存更敏感,这可以解释将输入更改为complex128会导致速度下降(522720对于FFT来说是一个不错的数字,所以也许我们应该期望速度会下降).

What's interesting about this example is (I think) how important the cache is in the results (which I suggest is why switching to a real transform is so effective in speeding things up). You see something dramatic also when you use change the array size to 524288 (the next power of two, which you think might perhaps speed things up, but not slow it down dramatically). In this case everything slows down quite a bit, scipy particularly. It feels to me that scipy is more cache sensitive, which would explain the slow down with changing the input to complex128 (522720 is quite a nice number for FFTing though, so perhaps we should expect a slowdown).

最后,如果速度仅次于精度,则始终可以将32位浮点数用作数据类型.如果将其与进行真正的变换相结合,则比上面给出的初始numpy最佳效果要好10倍.

Finally, if speed is secondary to accuracy, you can always use 32-bit floats as the data type. If you combine that with doing a real transform, you get a better than factor of 10 speed-up over the initial numpy best given above:

PYFFTW NUMPY fastest time =  0.09026529802940786
PYFFTW SCIPY fastest time =  0.1701313250232488
FFTW PURE fastest time =  0.06202622700948268

(numpy和scipy的变化不大,因为我认为它们内部使用64位浮点数.)

(numpy and scipy don't change much as I think they use 64-bit floats internally).

我忘记了Scipy的fftpack实数FFT具有怪异的输出结构,pyfftw会随着速度的降低而复制.在

I forgot that the Scipy's fftpack real FFTs have a weird output structure, which pyfftw replicates with some slowdown. This is changed to be more sensible in the new FFT module.

新的FFT接口为

The new FFT interface is implemented in pyFFTW and should be preferred. There was unfortunately a problem with the docs being rebuilt so the docs were a long time out of date and didn't show the new interface - hopefully that is fixed now.

这篇关于多维阵列上的PyFFTW性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆