这是所有平台上的最佳FIR滤波器吗? [英] Is this the optimal FIR filter on all platforms?

查看:64
本文介绍了这是所有平台上的最佳FIR滤波器吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




也许有人可以帮助我优化FIR / $
过滤器的这个C / C ++实现(虽然我意识到我应该考虑一种FFT方法

代替。)


下面的例子计算了10000

过滤器10000输出的输出信号水龙头。在这个例子中,输入信号和滤波器系数只是
垃圾。对于我对此过滤器的预期用法,存储过滤器状态不需要



我的问题是这个实现在我的

笔记本电脑上的Cygwin,但在Linux中不太好,在SunOS中更差。我试图用
来计算示例程序在

不同平台上消耗的大周期数(执行时间以秒为单位乘以CPU时钟

频率(以MHz为单位):


Cygwin:448个Mcycles(216个Mcycles没有* tmp_output_ptr ++ = sum;)

Linux:1090个Mcycles(没有151个Mcycles) * tmp_output_ptr ++ = sum;)

SunOS:2800个Mcycles(103个Mcycles没有* tmp_output_ptr ++ = sum;)


括号中的性能是在某一行获得的是

注释掉了(即行* tmp_output_ptr ++ = sum;)。正如你所看到的那样

在不同平台上的行为非常不同!


SunOS程序在没有批评的情况下表现非常出色。注意

,103 Mcycles意味着每次迭代只花费一个周期在内部for循环中!
但是当它的关键线包含在b
时,它的表现非常糟糕,大约慢了28倍!


作为对比,Cygwin程序消耗216个Mcycles,即大约两个

内循环中每次迭代的时钟周期,没有关键线和

只有关键线的两倍左右。


有人可以帮我加速Linux和SunOS

平台上的程序吗?


问候,

Johan

#include< malloc.h>

int main()

{

const int nrof_lags = 10000;

const int nrof_taps = 10000;

const int * const coeff_ptr =

(const int * const)malloc(nrof_taps * sizeof(int) );

const int * const input_ptr =

(const int * const)malloc((nrof_taps-1 + nrof_lags)* sizeof(int));

int const * output_ptr =(int const *)malloc(nrof_lags * sizeof(int));

const int * tmp_coeff_ptr;

const int * tm p_input_ptr;

int * tmp_output_ptr;

int sum;

int lag,tap;

tmp_output_ptr =(int *)output_ptr;

for(lag = 0;滞后< nrof_lags;滞后++)

{

tmp_coeff_ptr =(const int *)coeff_ptr;

tmp_input_ptr =(const int *)input_ptr + lag;

sum = 0;

for(tap = 0; tap< nrof_taps; tap ++)

{

sum + = * tmp_coeff_ptr ++ * * tmp_input_ptr ++;

}

* tmp_output_ptr ++ = sum;

}

}

解决方案

Johan Bergman写道:



也许有人可以帮我优化这个F / C ++实现了FIR
滤波器(虽然我意识到我应该考虑使用FFT方法。)

下面的例子计算10000滞后的输出信号具有10000个抽头的FIR
过滤器。在这个例子中,输入信号和滤波器系数只是垃圾。对于我对这个过滤器的预期用法,没有必要存储过滤器的状态。

我的问题是这个实现在我的笔记本电脑上的Cygwin中表现很好,但在Linux中不太好,在SunOS中甚至更糟。我试图计算示例程序在不同平台上消耗的大循环次数(以秒为单位的执行时间乘以CPU时钟频率,以MHz为单位):

Cygwin:448个Mcycles(216个Mcycles没有* tmp_output_ptr ++ = sum;)
Linux:1090个Mcycles(151个Mcycles没有* tmp_output_ptr ++ = sum;)
SunOS:2800个Mcycles(103个Mcycles没有* tmp_output_ptr ++ = sum;)

括号中的性能是在某一行被注释掉时获得的(即行* tmp_output_ptr ++ = sum;)。正如你所看到的那样,在不同的平台上有不同的行为!

SunOS程序在没有批评的情况下表现非常好。注意
103 Mcycles意味着内部for循环每次迭代只花费一个周期!但是当包含关键线时,它的表现非常糟糕,大约慢28倍!

作为比较,Cygwin程序消耗216个Mcycles,即每次迭代大约两个时钟周期在内循环中,没有关键线,只有关键线的两倍左右。

有人可以帮我加速Linux和SunOS上的程序
平台?

问候,
约翰

#include< malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const)malloc(nrof_taps * sizeof(int));
const int * const input_ptr =
(const int * const)malloc((nrof_taps-1 + nrof_lags)* sizeof(int));
int const * output_ptr =(int const *)malloc(nrof_lags * sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag ,点击;
tmp_output_ptr =(int *)output_ptr;
for(lag = 0;滞后< nrof_lags;滞后++)
{
tmp_coeff_ptr =(const int *)coeff_ptr;
tmp_input_ptr =(const int *)input_ptr + lag;
sum = 0;
for(tap = 0;点击< nrof_taps;点击++)
{
sum + = * tmp_coeff_ptr ++ * * tmp_input_ptr ++;
}
* tmp_output_ptr ++ = sum;
}
}



假设你在每种情况下都使用基本相同的编译器和相同的

优化选项,你必须运行不同的

硬件。您没有理由相信操作系统是负责任的。

如果您的一个或多个平台支持SSE / SSE2指令,那么您可以更改操作数
浮动/加倍,你应该能够获得更多的性能。将总和分成4或8个部分总和,因此

并行减少总和。如果你使用C编译器这就变成OT了

只支持asm中的并行指令。

-

Tim Prince




[警告:未来的长式评论]


2003年10月25日星期六,Johan Bergman写道:< blockquote class =post_quotes>
也许有人可以帮我优化FIR
过滤器的这个C / C ++实现(虽然我意识到我应该考虑使用FFT方法。)


世界上哪个FIR滤波器? (有限的冲动反应,是的,

是的,我知道,我查了一下。点是,几乎没有人在这个

新闻组中也会知道,他们' '如果他们真的想帮助你,那么*所有*必须查阅

。你正在让人们更加努力

来帮助你不定义你的条款。)

我的问题是这个实现在我的笔记本电脑上的Cygwin表现得很好,但在Linux中不太好,在SunOS中更差。我试图计算示例程序在不同平台上消耗的大周期数(以秒为单位的执行时间乘以CPU时钟的频率,以MHz为单位):

Cygwin:448个Mcycles(216个Mcycles没有* tmp_output_ptr ++ = sum;)
Linux:1090个Mcycles(151个Mcycles没有* tmp_output_ptr ++ = sum;)
SunOS:2800个Mcycles(103个Mcycles没有* tmp_output_ptr ++ = sum;)
有人可以帮我加速Linux和SunOS平台上的程序吗?


好​​吧,首先让我们清理你的代码然后看看它是什么样的

然后。清洁代码总是更容易操作。 :-)


#include< malloc.h>


非标头。 ''malloc''在< stdlib.h>中声明。


#include< stdlib.h>

int main()


[有些常客强烈相信''int main(void)''。不管怎样,我都不会有强烈的偏好,但是考虑你为什么选择你做过的风格是明智的,这是明智的



像这样的小东西。]

{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =


好​​的,是的,''const''很棒。但是真的,这太荒谬了。

你不觉得所有这些争论都会让你的

计划更有效率,对吗? (他们没有伤害,但是,b $ b过度消费很糟糕,主要是因为它很难看,丑陋的代码

很难做对。) />
(const int * const)malloc(nrof_taps * sizeof(int));


malloc(nrof_taps * sizeof * coeff_ptr);


使用这个成语时没有机会出错。

另外请注意,将malloc()结果转换为''const''任何东西

只是愚蠢,因为free()采用非常量指针

参数。你打算如何释放()这个记忆,嗯?


const int * const input_ptr =
(const int * const)malloc((nrof_taps-1 + nrof_lags) * sizeof(int));
int const * output_ptr =(int const *)malloc(nrof_lags * sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag,tap;


在我看来,有太多同名的变量。

再次,纯粹是一个风格问题,但同样,干净的代码是

很好的代码。请注意,在清理版本中,我已将其范围内的
提取到实际使用的位置。这可能实际上有助于提高程序的效率,因为一些

优化器会寻找windows。其中的对象可以移动到寄存器中等等。它们的范围越广,优化器就越难以看到它们。

tmp_output_ptr =(int *)output_ptr;


抛弃常量是坏的。特别是当你刚刚完成跳投,让它成为第一个

的地方时。 Geez。

for(lag = 0; lag< nrof_lags; lag ++)
{
tmp_coeff_ptr =(const int *)coeff_ptr;
tmp_input_ptr =(const int * )input_ptr + lag;


停止执行随机的游戏!为了Pete的缘故!

在这里抓住你自己!

sum = 0;
for(tap = 0;点击< nrof_taps;点击++)
{sum /> sum + = * tmp_coeff_ptr ++ * * tmp_input_ptr ++;


这一行(以及下面的一行)只是简单地要求简化为b $ b。让我们扩展它,这样我们就可以看到订单

的操作。

}
* tmp_output_ptr ++ = sum;
}


/ *总是* /返回0; / *来自主* /

}




好​​吧,让我们把它们放在一起:没有演员,干净的名字,

清理mallocs,看看我们得到了什么!


你们看到所有那些tmp _ * _ ptr对象如何消失

一旦我们开始清楚地看到程序的大纲。

实际上,我们发现内存分配中的一个微妙的错误,

并修复它(''input''是malloc ''ed one element short)!

最后,我们努力寻找惯用标识符:使用''i''和

''j''用于嵌套循环控件,并且在主数据阵列上丢失了愚蠢的

''_ ptr''后缀。

#include< stdlib.h>


int main(void)

{

int nrof_lags = 10000; / *这两个都可以#defines吗? * /

int nrof_taps = 10000; / *这会有所帮助。 * /


int * coeffs = malloc(nrof_taps * sizeof * coeffs);

int * input = malloc((nrof_taps + nrof_lags)* sizeof * input );

int * output = malloc(nrof_lags * sizeof * output);

int i,j;


for( i = 0; i< nrof_lags; ++ i)

{

int sum = 0;

for(j = 0; j < nrof_taps; ++ j)

sum + = coeffs [j] * input [i + j];

output [i] = sum;

}


免费(coeffs);

免费(输入);

免费(输出);

返回0;

}

有 - 这不是更简单,更容易
$ b $面包?我敢打赌它在你所有的机器上表现得相当快。

如果没有,你可以尝试将''int''改为''unsigned int''(

,事实上,无论如何,你可能会这样做,以防止溢出

条件);颠倒任何一个或两个循环的顺序;

如果''coeffs [j]''中的许多都为零,请尝试预处理

''coffs''a a一点;这样的小调整。


我认为你的大部分性能差异是由于不同机器上不同的缓存布局造成的,但我是

很容易出错。


代码现在怎么做?


HTH,

-Arthur


" Johan Bergman" <我们** @ inter.net>在消息中写道

新闻:_y ******************** @ newsc.telia.net ...



也许有人可以帮我优化FIR
滤波器的这个C / C ++实现(尽管我意识到我应该考虑使用FFT方法。)

以下示例计算10000个抽头的FIR
滤波器的输出信号为10000滞后。在这个例子中,输入信号和滤波器系数是
只是垃圾。对于我对此过滤器的预期用法,不需要存储过滤器的状态。

我的问题是这个实现在我的笔记本电脑上的Cygwin中表现非常好,但在Linux中不太好,在SunOS中甚至更糟。我试图计算示例程序在
不同平台上消耗的大周期数(以秒为单位的执行时间乘以CPU时钟的频率,以MHz为单位):

Cygwin:448个Mcycles(216个Mcycles没有* tmp_output_ptr ++ = sum;)
Linux:1090个Mcycles(151个Mcycles没有* tmp_output_ptr ++ = sum;)
SunOS:2800个Mcycles(103个Mcycles没有* tmp_output_ptr ++ = sum;)

括号中的性能是在某一行被注释掉时获得的(即行* tmp_output_ptr ++ = sum;)。正如你可以看到
在不同的平台上有不同的行为!

SunOS程序在没有批评的情况下表现非常好。注意
103 Mcycles意味着内部for循环每次迭代只花费一个周期!但是当包含关键线时,它的表现非常糟糕,大约慢28倍!

作为比较,Cygwin程序消耗216个Mcycles,即每次迭代大约两个时钟周期在内循环中,没有关键行
,只有关键行的两倍左右。

有人可以帮我加速Linux和SunOS上的程序
平台?

问候,
约翰

#include< malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const)malloc(nrof_taps * sizeof(int));
const int * const input_ptr =
(const int * const)malloc((nrof_taps-1 + nrof_lags)* sizeof(int));
int const * output_ptr =(int const *)malloc(nrof_lags * sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag,点击;
tmp_output_ptr =(int *)output_ptr;
for(lag = 0;滞后< nrof_lags;滞后++)
{
tmp_coeff_ptr =(const int *)coeff_ptr;
tmp_input_ptr =(const int *)input_ptr + lag;
sum = 0;
for(tap = 0;点击< nrof_taps;点击++)
{
sum + = * tmp_coeff_ptr ++ * * tmp_input_ptr ++;
}
* tmp_output_ptr ++ = sum;
}
}



不知道问题是什么,但如果你注释掉了

* tmp_output_ptr ++ = sum;

编译器可能会优化掉在前一个陈述中计算总和,

可以解释业绩差异。尝试

1.优化代码一点点,而不是

sum = 0;

for(tap = 0; tap< nrof_taps; tap ++)

{

sum + = * tmp_coeff_ptr ++ * * tmp_input_ptr ++;

}

* tmp_output_ptr ++ = sum;



for(tap = 0; tap< nrof_taps; tap ++)

* tmp_output_ptr ++ + = * tmp_coeff_ptr ++ * * tmp_input_ptr ++;

(在每个外部迭代中清除总和,因此我假设它是'/ b $ b以后没有使用过的地方)

2.使用calloc()而不是malloc( )。我不确定什么是对齐

的含义(一些额外的数据访问指令通过

dereferenced

ptr in malloc''d块?)如果malloc()在任何特定平台上使用。


我希望发布正确的解释和解决方案,

当你找到它时。


Hi,

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)

The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is just
rubbish in this example. For my intended usage of this filter, it is not
necessary to store the state of the filter.

My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the
different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)

The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see there
were very different behaviors on the different platforms!

The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is
included, about 28 times slower!

As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line and
only about twice as much with the critical line.

Can someone help me to speed up the program on the Linux and SunOS
platforms?

Regards,
Johan
#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}

解决方案

Johan Bergman wrote:

Hi,

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)

The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is
just rubbish in this example. For my intended usage of this filter, it is
not necessary to store the state of the filter.

My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on
the different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)

The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see
there were very different behaviors on the different platforms!

The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is
included, about 28 times slower!

As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line
and only about twice as much with the critical line.

Can someone help me to speed up the program on the Linux and SunOS
platforms?

Regards,
Johan
#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}


Assuming that you are using basically the same compiler and the same
optimization options in each case, you must be running on different
hardware. You''ve given no reason to believe that the OS is responsible.
If one or more of your platforms supports SSE/SSE2 instructions, and you
can change the operands to float/double, you should be able to get much
better performance. Split the sum into 4 or 8 partial sums, thus
parallelizing the sum reduction. This becomes OT if you use a C compiler
whichs supports parallel instructions only in asm.
--
Tim Prince



[Warning: long style critique ahead]

On Sat, 25 Oct 2003, Johan Bergman wrote:


Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)
What in the world is an FIR filter? (Finite impulse-response, yes,
yes, I know, I looked it up. Point is, practically nobody in this
newsgroup will know, either, and they''ll *all* have to look it up
if they really want to help you. You''re making it harder for people
to help you when you don''t define your terms.)
My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the
different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;) Can someone help me to speed up the program on the Linux and SunOS
platforms?
Well, first let''s clean up your code and see what it looks like
then. Clean code is always easier to operate on. :-)

#include <malloc.h>
Non-standard header. ''malloc'' is declared in <stdlib.h>.

#include <stdlib.h>
int main()
[Some regulars strongly believe in ''int main(void)''. I
don''t have a strong preference either way, but it''s wise
to consider why you picked the style you did, even on
little things like this.]
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
Okay, yeah, ''const'' is great. But really, this is ridiculous.
You don''t think that all these consts are somehow making your
program more efficient, do you? (They''re not hurting, but
over-constifying is Bad mostly because it''s Ugly, and Ugly code
is hard to make Right.)
(const int * const) malloc(nrof_taps*sizeof(int));
malloc(nrof_taps * sizeof *coeff_ptr);

No chance for error using this idiom.
Also note that casting a malloc() result to ''const'' anything
is just silly, because free() takes a non-const pointer
argument. How are you going to free() this memory, hmm?

const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
Way too many similarly-named variables, in my opinion.
Again, purely a stylistic issue, but again, clean code is
nice code. Note that in the cleaned-up version, I''ve pulled
in their scope to where they''re actually used. This may
actually help with the efficiency of the program, since some
optimizers look for "windows" in which objects can be moved
into registers and suchlike. The wider their scopes are, the
harder it is for the optimizer to see them.
tmp_output_ptr = (int *) output_ptr;
Casting away constness is Bad. Especially when you''ve just
finished jumping through hoops to make it const in the first
place. Geez.
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
STOP PERFORMING RANDOM CASTS! For Pete''s sake!
Get a hold of yourself here!
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
This line (and the one below) are just begging to be
simplified. Let''s expand it so we can see the order
of operations.
}
*tmp_output_ptr++ = sum;
}
/* always */ return 0; /* from main */
}



Okay, let''s put it all together: no casts, clean names,
clean mallocs, and see what we get!

You see how all those tmp_*_ptr objects just melt away
once we start to see the outline of the program clearly.
In fact, we uncover a subtle bug in the memory allocation,
and fix it (''input'' was being malloc''ed one element short)!
Finally, we strive for idiomatic identifiers: using ''i'' and
''j'' for the nested loop controls, and losing the silly
''_ptr'' suffixes on the main data arrays.
#include <stdlib.h>

int main(void)
{
int nrof_lags = 10000; /* Could both of these be #defines? */
int nrof_taps = 10000; /* That would help a little. */

int *coeffs = malloc(nrof_taps * sizeof *coeffs);
int *input = malloc((nrof_taps+nrof_lags) * sizeof *input);
int *output = malloc(nrof_lags * sizeof *output);
int i, j;

for (i=0; i < nrof_lags; ++i)
{
int sum = 0;
for (j=0; j < nrof_taps; ++j)
sum += coeffs[j] * input[i+j];
output[i] = sum;
}

free(coeffs);
free(input);
free(output);
return 0;
}
There -- isn''t that a heck of a lot simpler and easier to
read? I bet it performs comparably fast on all your machines.
If not, you might try changing ''int'' to ''unsigned int'' (in
fact, you might do that anyway, to guard against overflow
conditions); reversing the order of either or both loops;
if many of the ''coeffs[j]'' are zero, try pre-processing
''coeffs'' a little bit; little tweaks like that.

I think most of your performance differences are due to
different cache layouts on the different machines, but I
could easily be wrong.

How''s the code do now?

HTH,
-Arthur


"Johan Bergman" <us**@inter.net> wrote in message
news:_y********************@newsc.telia.net...

Hi,

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)

The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is just rubbish in this example. For my intended usage of this filter, it is not
necessary to store the state of the filter.

My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)

The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see there were very different behaviors on the different platforms!

The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is
included, about 28 times slower!

As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line and only about twice as much with the critical line.

Can someone help me to speed up the program on the Linux and SunOS
platforms?

Regards,
Johan
#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}


No idea what the problem is, but if you comment out
*tmp_output_ptr++ = sum;
compiler may optimize away calculation of sum in previous statement,
which could explain performance differences. Try
1. optimize code little bit, instead of
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
do
for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
(sum is cleared in each outer iteration and therefore I assume it''s
not used somewhere later)
2. use calloc() instead of malloc(). I''m not sure what are alignment
implications (some additional instructions for acces to data via
dereferenced
ptr in malloc''d block?) in case of malloc() use on any particular platform.

I would appreciate posting of proper explanation and solution,
when you find it.


这篇关于这是所有平台上的最佳FIR滤波器吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆