多维阵列上的PyFFTW性能 [英] PyFFTW perfomance on multidimensional arrays
问题描述
我有一个维度为(144,522720)的nD数组,我需要计算其FFT.
I have a nD array, say of dimensions: (144, 522720) and I need to compute its FFT.
PyFFTW
似乎比numpy
和scipy
慢,这是不期望的.
PyFFTW
seems slower than numpy
and scipy
, that it is NOT expected.
我做错了什么吗?
下面是我的代码
import numpy
import scipy
import pyfftw
import time
n1 = 144
n2 = 522720
loops = 2
pyfftw.config.NUM_THREADS = 4
pyfftw.config.PLANNER_EFFORT = 'FFTW_ESTIMATE'
# pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')
Q_2 = pyfftw.empty_aligned([n1, n2], dtype='complex_')
Q_ref = pyfftw.empty_aligned([n1, n2], dtype='complex_')
# repeat a few times to see if pyfft planner helps
for i in range(0,loops):
Q_1 = numpy.random.rand(n1,n2)
s1 = time.time()
Q_ref = numpy.fft.fft(Q_1, axis=0)
print('NUMPY - elapsed time: ', time.time() - s1, 's.')
s1 = time.time()
Q_2 = scipy.fft.fft(Q_1, axis=0)
print('SCIPY - elapsed time: ', time.time() - s1, 's.')
print('Equal = ', numpy.allclose(Q_2, Q_ref))
s1 = time.time()
Q_2 = pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0)
print('PYFFTW NUMPY - elapsed time = ', time.time() - s1, 's.')
print('Equal = ', numpy.allclose(Q_2, Q_ref))
s1 = time.time()
Q_2 = pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0)
print('PYFFTW SCIPY - elapsed time = ', time.time() - s1, 's.')
print('Equal = ', numpy.allclose(Q_2, Q_ref))
s1 = time.time()
fft_object = pyfftw.builders.fft(Q_1, axis=0)
Q_2 = fft_object()
print('FFTW PURE Elapsed time = ', time.time() - s1, 's')
print('Equal = ', numpy.allclose(Q_2, Q_ref))
推荐答案
首先,如果在主循环之前打开缓存,则接口在很大程度上可以按预期工作:
Firstly, if you turn on the cache before you main loop, the interfaces work largely as expected:
pyfftw.interfaces.cache.enable()
pyfftw.interfaces.cache.set_keepalive_time(30)
有趣的是,尽管智慧应该存储的pyfftw
对象的构造在关闭缓存时仍然相当慢.没关系,这正是缓存的目的.在您的情况下,由于循环很长,您需要使缓存的保持活动时间相当长.
It's interesting that despite wisdom that should be stored, the construction of the pyfftw
objects is still rather slow when the cache is off. No matter, this is exactly the purpose of the cache. In your case you need to make the cache keep-alive time quite long because your loop is very long.
第二,将fft_object
的构建时间包含在最终测试中是不公平的比较.如果将其移到计时器之外,则调用fft_object
是更好的方法.
Secondly, it's not a fair comparison to include the construction time of the fft_object
in the final test. If you move it outside the timer, then calling fft_object
is a better measure.
第三,有趣的是,即使打开了缓存,对numpy_fft
的调用也比对scipy_fft
的调用慢.由于代码路径没有明显差异,我建议这是缓存问题.这是timeit
试图缓解的问题.这是我建议的计时代码,它更有意义:
Thirdly, it's also interesting to see that even with cache turned on, the call to numpy_fft
is slower than the call to scipy_fft
. Since there is no obvious difference in the code path, I suggest that is caching issue. This is the sort of issue that timeit
seeks to mitigate. Here's my proposed timing code which is more meaningful:
import numpy
import scipy
import pyfftw
import timeit
n1 = 144
n2 = 522720
pyfftw.config.NUM_THREADS = 4
pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')
pyfftw.interfaces.cache.enable()
pyfftw.interfaces.cache.set_keepalive_time(30)
times = timeit.repeat(lambda: numpy.fft.fft(Q_1, axis=0), repeat=5, number=1)
print('NUMPY fastest time = ', min(times))
times = timeit.repeat(lambda: scipy.fft.fft(Q_1, axis=0), repeat=5, number=1)
print('SCIPY fastest time = ', min(times))
times = timeit.repeat(
lambda: pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0), repeat=5, number=1)
print('PYFFTW NUMPY fastest time = ', min(times))
times = timeit.repeat(
lambda: pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0), repeat=5, number=1)
print('PYFFTW SCIPY fastest time = ', min(times))
fft_object = pyfftw.builders.fft(Q_1, axis=0)
times = timeit.repeat(lambda: fft_object(Q_1), repeat=5, number=1)
print('FFTW PURE fastest time = ', min(times))
在我的机器上,这样的输出如下:
On my machine this gives an output like:
NUMPY fastest time = 0.6622681759763509
SCIPY fastest time = 0.6572431400418282
PYFFTW NUMPY fastest time = 0.4003451430471614
PYFFTW SCIPY fastest time = 0.40362057799939066
FFTW PURE fastest time = 0.324020683998242
如果不通过将Q_1
更改为complex128
,不强迫它将输入数组复制为复杂数据类型,则可以做得更好:
You can do a bit better if you don't force it to copy the input array into a complex data type by changing Q_1
to be complex128
:
NUMPY fastest time = 0.6483533839927986
SCIPY fastest time = 0.847397351055406
PYFFTW NUMPY fastest time = 0.3237176960101351
PYFFTW SCIPY fastest time = 0.3199474769644439
FFTW PURE fastest time = 0.2546963169006631
有趣的scipy
减速是可重复的.
That interesting scipy
slow-down is repeatable.
也就是说,如果您的输入是真实的,则应该进行真实的转换(使用pyfftw
可使速度提高50%以上)并处理最终的复杂输出.
That said, if your input is real, you should be doing a real transform (for >50% speed-up with pyfftw
) and manipulating the resultant complex output.
这个例子有趣的是(我认为)缓存在结果中有多重要(我建议这就是为什么切换到真正的转换如此有效地加快处理速度).使用阵列大小更改为524288时,您也会看到一些戏剧性的东西(二阶幂,您认为这也许会加快速度,但不会大大降低速度).在这种情况下,一切都会变慢很多,特别是scipy
.在我看来,scipy
对缓存更敏感,这可以解释将输入更改为complex128
会导致速度下降(522720对于FFT来说是一个不错的数字,所以也许我们应该期望速度会下降).
What's interesting about this example is (I think) how important the cache is in the results (which I suggest is why switching to a real transform is so effective in speeding things up). You see something dramatic also when you use change the array size to 524288 (the next power of two, which you think might perhaps speed things up, but not slow it down dramatically). In this case everything slows down quite a bit, scipy
particularly. It feels to me that scipy
is more cache sensitive, which would explain the slow down with changing the input to complex128
(522720 is quite a nice number for FFTing though, so perhaps we should expect a slowdown).
最后,如果速度仅次于精度,则始终可以将32位浮点数用作数据类型.如果将其与进行真正的变换相结合,则比上面给出的初始numpy
最佳效果要好10倍.
Finally, if speed is secondary to accuracy, you can always use 32-bit floats as the data type. If you combine that with doing a real transform, you get a better than factor of 10 speed-up over the initial numpy
best given above:
PYFFTW NUMPY fastest time = 0.09026529802940786
PYFFTW SCIPY fastest time = 0.1701313250232488
FFTW PURE fastest time = 0.06202622700948268
(numpy和scipy的变化不大,因为我认为它们内部使用64位浮点数.)
(numpy and scipy don't change much as I think they use 64-bit floats internally).
我忘记了Scipy的fftpack
实数FFT具有怪异的输出结构,pyfftw
会随着速度的降低而复制.在
I forgot that the Scipy's fftpack
real FFTs have a weird output structure, which pyfftw
replicates with some slowdown. This is changed to be more sensible in the new FFT module.
The new FFT interface is implemented in pyFFTW and should be preferred. There was unfortunately a problem with the docs being rebuilt so the docs were a long time out of date and didn't show the new interface - hopefully that is fixed now.
这篇关于多维阵列上的PyFFTW性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!