加速FFTW修剪以避免大量零填充 [英] Accelerating FFTW pruning to avoid massive zero padding
问题描述
假设我有一个长度K * N
的序列x(n)
,并且只有第一个N
元素不同于零.我假设N << K
,例如N = 10
和K = 100000
.我想通过FFTW计算这样一个序列的FFT.这等效于具有长度为N
的序列,并且对K * N
的填充为零.因为N
和K
可能是大"的,所以我有一个很大的零填充.我正在研究是否可以节省一些计算时间,避免显式零填充.
Suppose that I have a sequence x(n)
which is K * N
long and that only the first N
elements are different from zero. I'm assuming that N << K
, say, for example, N = 10
and K = 100000
. I want to calculate the FFT, by FFTW, of such a sequence. This is equivalent to having a sequence of length N
and having a zero padding to K * N
. Since N
and K
may be "large", I have a significant zero padding. I'm exploring if I can save some computation time avoid explicit zero padding.
案件K = 2
The case K = 2
让我们开始考虑案例K = 2
.在这种情况下,x(n)
的DFT可以写为
Let us begin by considering the case K = 2
. In this case, the DFT of x(n)
can be written as
如果k
是偶数,即k = 2 * m
,则
这意味着可以通过对长度为N
而不是K * N
的序列进行FFT来计算DFT的这些值.
which means that such values of the DFT can be calculated through an FFT of a sequence of length N
, and not K * N
.
如果k
为奇数,即k = 2 * m + 1
,则
这意味着可以通过对长度为N
而不是K * N
的序列进行FFT来再次计算DFT的此类值.
which means that such values of the DFT can be again calculated through an FFT of a sequence of length N
, and not K * N
.
因此,总而言之,我可以将长度为2 * N
的单个FFT与长度为N
的2
FFT交换.
So, in conclusion, I can exchange a single FFT of length 2 * N
with 2
FFTs of length N
.
任意K
The case of arbitrary K
在这种情况下,我们有
在撰写k = m * K + t
时,我们有
因此,总而言之,我可以将长度为K * N
的单个FFT与长度为N
的K
FFT交换.由于FFTW具有fftw_plan_many_dft
,我可以期望在单次FFT的情况下会有所收获.
So, in conclusion, I can exchange a single FFT of length K * N
with K
FFTs of length N
. Since the FFTW has fftw_plan_many_dft
, I can expect to have some gaining against the case of a single FFT.
为验证这一点,我设置了以下代码
To verify that, I have set up the following code
#include <stdio.h>
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */
#include <math.h>
#include <fstream>
#include <fftw3.h>
#include "TimingCPU.h"
#define PI_d 3.141592653589793
void main() {
const int N = 10;
const int K = 100000;
fftw_plan plan_zp;
fftw_complex *h_x = (fftw_complex *)malloc(N * sizeof(fftw_complex));
fftw_complex *h_xzp = (fftw_complex *)calloc(N * K, sizeof(fftw_complex));
fftw_complex *h_xpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhatpruning = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhatpruning_temp = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
fftw_complex *h_xhat = (fftw_complex *)malloc(N * K * sizeof(fftw_complex));
// --- Random number generation of the data sequence
srand(time(NULL));
for (int k = 0; k < N; k++) {
h_x[k][0] = (double)rand() / (double)RAND_MAX;
h_x[k][1] = (double)rand() / (double)RAND_MAX;
}
memcpy(h_xzp, h_x, N * sizeof(fftw_complex));
plan_zp = fftw_plan_dft_1d(N * K, h_xzp, h_xhat, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan plan_pruning = fftw_plan_many_dft(1, &N, K, h_xpruning, NULL, 1, N, h_xhatpruning_temp, NULL, 1, N, FFTW_FORWARD, FFTW_ESTIMATE);
TimingCPU timerCPU;
timerCPU.StartCounter();
fftw_execute(plan_zp);
printf("Stadard %f\n", timerCPU.GetCounter());
timerCPU.StartCounter();
double factor = -2. * PI_d / (K * N);
for (int k = 0; k < K; k++) {
double arg1 = factor * k;
for (int n = 0; n < N; n++) {
double arg = arg1 * n;
double cosarg = cos(arg);
double sinarg = sin(arg);
h_xpruning[k * N + n][0] = h_x[n][0] * cosarg - h_x[n][1] * sinarg;
h_xpruning[k * N + n][1] = h_x[n][0] * sinarg + h_x[n][1] * cosarg;
}
}
printf("Optimized first step %f\n", timerCPU.GetCounter());
timerCPU.StartCounter();
fftw_execute(plan_pruning);
printf("Optimized second step %f\n", timerCPU.GetCounter());
timerCPU.StartCounter();
for (int k = 0; k < K; k++) {
for (int p = 0; p < N; p++) {
h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0];
h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1];
}
}
printf("Optimized third step %f\n", timerCPU.GetCounter());
double rmserror = 0., norm = 0.;
for (int n = 0; n < N; n++) {
rmserror = rmserror + (h_xhatpruning[n][0] - h_xhat[n][0]) * (h_xhatpruning[n][0] - h_xhat[n][0]) + (h_xhatpruning[n][1] - h_xhat[n][1]) * (h_xhatpruning[n][1] - h_xhat[n][1]);
norm = norm + h_xhat[n][0] * h_xhat[n][0] + h_xhat[n][1] * h_xhat[n][1];
}
printf("rmserror %f\n", 100. * sqrt(rmserror / norm));
fftw_destroy_plan(plan_zp);
}
我开发的方法包括三个步骤:
The approach I have developed consists of three steps:
- 将输入序列乘以缠绕"复杂指数;
- 执行
fftw_many
; - 重新组织结果.
- Multiplying the input sequence by "twiddle" complex exponentials;
- Performing the
fftw_many
; - Reorganizing the results.
fftw_many
比K * N
输入点上的单个FFTW更快.但是,步骤1和步骤3完全破坏了这种增益.我希望第1步和第3步在计算上比第2步轻得多.
The fftw_many
is faster than the single FFTW on K * N
input points. However, steps #1 and #3 completely destroy such a gain. I would expect that steps #1 and #3 be computationally much lighter than step #2.
我的问题是:
- 步骤1和步骤3a在计算上比步骤2要求高的可能性如何?
- 与标准"方法相比,如何改善步骤1和3以获得净收益?
非常感谢您提供任何提示.
Thank you very much for any hint.
编辑
我正在使用Visual Studio 2013,并以发布模式进行编译.
I'm working with Visual Studio 2013 and compiling in Release mode.
推荐答案
对于第三步,您可能想要尝试切换循环的顺序:
For the third step you might want to try switching the order of the loops:
for (int p = 0; p < N; p++) {
for (int k = 0; k < K; k++) {
h_xhatpruning[p * K + k][0] = h_xhatpruning_temp[p + k * N][0];
h_xhatpruning[p * K + k][1] = h_xhatpruning_temp[p + k * N][1];
}
}
因为通常使存储地址比加载地址更连续是有益的.
since it's generally more beneficial to have the store addresses be contiguous than the load addresses.
无论哪种方式,您都可以使用对缓存不友好的访问模式.您可以尝试使用块来改善这一点,例如假设N是4的倍数:
Either way you have a cache-unfriendly access pattern though. You could try working with blocks to improve this, e.g. assuming N is a multiple of 4:
for (int p = 0; p < N; p += 4) {
for (int k = 0; k < K; k++) {
for (int p0 = 0; p0 < 4; p0++) {
h_xhatpruning[(p + p0) * K + k][0] = h_xhatpruning_temp[(p + p0) + k * N][0];
h_xhatpruning[(p + p0) * K + k][1] = h_xhatpruning_temp[(p + p0) + k * N][1];
}
}
}
这应该有助于减少缓存行的流失.如果可以,那么还可以尝试使用4以外的块大小来查看是否存在最佳点".
This should help to reduce the churn of cache lines somewhat. If it does then maybe also experiment with block sizes other than 4 to see if there is a "sweet spot".
这篇关于加速FFTW修剪以避免大量零填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!