乘以2的乘法 [英] Multiplications by powers of 2

查看:83
本文介绍了乘以2的乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个程序(恰好是一个图像处理过滤器),它以2的常数幂计算浮点数的几次乘法。

这意味着添加一个整数常量到指数而不改变

尾数,我相信这比乘以两个

浮点数更简单。相反,如果我使用C" ldexp"

函数编写程序,它运行得慢得多并产生相同的输出。


因为程序应该在奔腾4处理器上运行,我从英特尔的网站上下载了一本手册


http://www.intel.com/design/pentium4/manuals/248966.htm

并确实浮动 - 点乘法(汇编程序中的fmul)具有8个时钟周期的等待时间

,而改变指数(汇编程序中的fscale)具有

等待时间60个周期。这对我来说没有意义。


我想问一下是否有人知道一个快速的方法将

浮点数乘以一个恒定的幂2,或者如果我应该使用标准的

乘法。我必须使用浮点运算,因为输入数据

可以跨越几个数量级并且程序还计算方形

根和指数。速度很重要,因为程序应该处理高分辨率视频。


我试图使用union为指数添加一个整数,但这是仍然比纯浮点乘法慢一点b $ ba。我想这会发生

,因为处理器使用不同的寄存器进行整数和

浮点运算,因此一些时钟周期在移动时浪费了

周围的数据。同样以这种方式,下溢给出了不正确的结果,例如

乘以 0.0乘2 ^( - 1)给出-Inf。也许我可以通过使用SSE指令来获得一些速度,这是(据我所知)对整数和浮点运算使用相同的

寄存器,但这不会解决下溢问题。


我希望有人可以帮助我。在此先感谢。


电子工程博士生

I am writing a program (precisely an image processing filter) which computes
several multiplications of floating-point numbers by constant powers of 2.
This means adding an integer constant to the exponent without changing the
mantissa, and I believed that this was simpler than multiplying two
floating-point numbers. Instead, if I write the program using the C "ldexp"
function, it runs much slower and produces the same output.

Since the program should run on a Pentium 4 processor, I downloaded a manual
from Intel''s website
http://www.intel.com/design/pentium4/manuals/248966.htm
and indeed floating-point multiplication ("fmul" in assembler) has a latency
of 8 clock cycles, while changing the exponent ("fscale" in assembler) has a
latency of 60 cycles. This makes no sense to me.

I would like to ask if anyone knows a fast method for multiplying a
floating-point number by a constant power of 2, or if I should use standard
multiplication. I must use floating-point arithmetic because the input data
can span several orders of magnitude and the program also computes square
roots and exponentials. Speed is important, because the program should
process high-resolution video.

I tried to add an integer to the exponent using an union, but this is still
a bit slower than pure floating-point multiplication. I guess this happens
because the processor uses different registers for integer and
floating-point operations, therefore some clock cycles are wasted in moving
data around. Also in this way underflows give incorrect results, e.g.
"multiplying" 0.0 by 2^(-1) gives -Inf. Maybe I could gain some speed by
using SSE instructions, which (as far as I know) use the same set of
registers for both integer and floating-point operations, but this will not
solve the underflow problem.

I hope someone could help me. Thanks in advance.

An Electronics Engineering PhD student

推荐答案

Lisa Simpson写道:
Lisa Simpson wrote:
我想问一下是否有人知道一个快速的方法将
浮点数乘以2的恒定幂,
或者我应该使用标准乘法。
I would like to ask if anyone knows a fast method for multiplying a
floating-point number by a constant power of 2,
or if I should use standard multiplication.




使用标准乘法。

如果乘法运算符的一个操作数

是一个常量表达式价值2,

然后你的编译器拥有它需要的所有信息

来进行任何优化。


-

pete



Use standard multiplication.
If one of the operands of the multiplication operator
is a constant expression with a value of 2,
then your compiler has all the information it needs
to do any optimization that there is to be done.

--
pete


Lisa Simpson写道:
Lisa Simpson wrote:
我正在写一个程序(正是一个图像处理过滤器)通过2的常数幂计算浮点数的几个乘法。
这意味着向整数常数添加一个整数常量指数没有改变尾数,我相信这比乘以两个浮点数简单。相反,如果我使用C" ldexp"
函数编写程序,它运行得慢得多并产生相同的输出。

由于程序应该在Pentium 4处理器上运行,我从英特尔网站上下载手册
http://www.intel.com/design/pentium4/manuals/248966.htm
确实浮点乘法(汇编程序中的fmul)具有延迟
8个时钟周期,而改变指数(汇编器中的fscale)具有60个周期的延迟。这对我来说毫无意义。

为什么不呢?常用操作已经过优化;不常见的不是。只是

,因为你认为一个操作应该更快,因为它在概念上是简单的b $ b更简单并不意味着情况就是这样。

我想问一下是否有人知道一个快速的方法将
浮点数乘以2的常数幂,或者我是否应该使用标准的
乘法。


你应该使用乘法,因为你试图乘以。 :-)


将优化保留给编译器。如果时间显示你的程序没有足够快的时间用于你的目的,使用分析来确定

瓶颈的位置,并找到删除它的方法。瓶颈几乎永远不会是原始操作,而是使用它们的算法。

我必须使用浮点运算,因为输入数据可以跨越几个数量级和程序还计算平方根和指数。速度很重要,因为程序应该处理高分辨率视频。

速度很重要。与我知道速度有多重要,

目前的速度是什么,以及速度是不够的不一样。


速度几乎从不*不重要*,但这并不意味着你应该过早优化
优化。

我试图使用联合为指数添加一个整数,但这仍然比纯浮点乘法慢一点。我想这会发生,因为处理器使用不同的寄存器进行整数和浮点运算,因此在移动数据时浪费了一些时钟周期。同样以这种方式,下溢给出了不正确的结果,例如乘法。 0.0乘2 ^( - 1)给出-Inf。也许我可以通过使用SSE指令获得一些速度,据我所知,这些指令使用相同的
寄存器进行整数和浮点运算,但这不会
解决下溢问题。
I am writing a program (precisely an image processing filter) which computes
several multiplications of floating-point numbers by constant powers of 2.
This means adding an integer constant to the exponent without changing the
mantissa, and I believed that this was simpler than multiplying two
floating-point numbers. Instead, if I write the program using the C "ldexp"
function, it runs much slower and produces the same output.

Since the program should run on a Pentium 4 processor, I downloaded a manual
from Intel''s website
http://www.intel.com/design/pentium4/manuals/248966.htm
and indeed floating-point multiplication ("fmul" in assembler) has a latency
of 8 clock cycles, while changing the exponent ("fscale" in assembler) has a
latency of 60 cycles. This makes no sense to me.
Why not? Common operations are optimized; uncommon ones are not. Just
because you think one operation ought to be faster because it''s conceptually
simpler doesn''t mean that will be the case.
I would like to ask if anyone knows a fast method for multiplying a
floating-point number by a constant power of 2, or if I should use standard
multiplication.
You should use multiplication, since you''re trying to multiply. :-)

Leave optimization to the compiler. If timing shows you an program isn''t
fast enough for your purposes, use profiling to determine where the
bottleneck is, and find ways to remove it. The bottlenecks are almost never
primitive operations, but rather the algorithm that uses them.
I must use floating-point arithmetic because the input data
can span several orders of magnitude and the program also computes square
roots and exponentials. Speed is important, because the program should
process high-resolution video.
"Speed is important" is not the same as "I know how important the speed is,
what the current speed is, and where the speed isn''t sufficient".

Speed is almost never *unimportant*, but that doesn''t mean you should
optimize prematurely.
I tried to add an integer to the exponent using an union, but this is still
a bit slower than pure floating-point multiplication. I guess this happens
because the processor uses different registers for integer and
floating-point operations, therefore some clock cycles are wasted in moving
data around. Also in this way underflows give incorrect results, e.g.
"multiplying" 0.0 by 2^(-1) gives -Inf. Maybe I could gain some speed by
using SSE instructions, which (as far as I know) use the same set of
registers for both integer and floating-point operations, but this will not
solve the underflow problem.



SSE可以很好地加快速度,但不是因为你提到的原因。 SSE

旨在快速将单个操作应用于多组数据。

而不是将两个数相乘,SSE可用于乘以
$的向量b $ b数字,有效地在音乐会上做了很多次乘法。现代编译器

甚至可以在某些情况下透明地使用它。


如何使用SSE和类似的矢量化技术对于这个

ng,但那里有很多信息。与试图优化乘法相比,这似乎与你的问题相关更多。


S.


SSE may very well speed up things, but not for the reasons you mention. SSE
is designed to quickly apply a single operation to multiple sets of data.
Rather than multiply two numbers, SSE could be used to multiply vectors of
numbers, effectively doing many multiplications in concert. Modern compilers
can even make use of this transparently in some circumstances.

How to use SSE and comparable vectorizing technologies is off-topic to this
ng, but there''s plenty of information out there. This seems much more
relevant to your problem than trying to optimize multiplication.

S.


Lisa Simpson写道:
"Lisa Simpson" wrote:
我正在编写一个程序(恰好是一个图像处理过滤器),它通过2的常数幂来计算浮点数的几次乘法。
这意味着添加一个整数在不改变尾数的情况下对指数不变,我相信这比乘以两个浮点数更简单。相反,如果我使用C" ldexp"
函数编写程序,它运行得慢得多并产生相同的输出。

由于程序应该在Pentium 4处理器上运行,我从英特尔网站上下载手册
http://www.intel.com/design/pentium4/manuals/248966.htm
确实浮点乘法(汇编程序中的fmul)具有延迟
8个时钟周期,而改变指数(汇编器中的fscale)具有60个周期的延迟。这对我来说毫无意义。

我想问一下是否有人知道将
浮点数乘以2的常数幂的快速方法,或者我是否应该使用标准
乘法。我必须使用浮点运算,因为输入数据可以跨越几个数量级,程序也可以计算平方根和指数。速度很重要,因为程序应该处理高分辨率视频。
I am writing a program (precisely an image processing filter) which computes
several multiplications of floating-point numbers by constant powers of 2.
This means adding an integer constant to the exponent without changing the
mantissa, and I believed that this was simpler than multiplying two
floating-point numbers. Instead, if I write the program using the C "ldexp"
function, it runs much slower and produces the same output.

Since the program should run on a Pentium 4 processor, I downloaded a manual
from Intel''s website
http://www.intel.com/design/pentium4/manuals/248966.htm
and indeed floating-point multiplication ("fmul" in assembler) has a latency
of 8 clock cycles, while changing the exponent ("fscale" in assembler) has a
latency of 60 cycles. This makes no sense to me.

I would like to ask if anyone knows a fast method for multiplying a
floating-point number by a constant power of 2, or if I should use standard
multiplication. I must use floating-point arithmetic because the input data
can span several orders of magnitude and the program also computes square
roots and exponentials. Speed is important, because the program should
process high-resolution video.




两个非常通用,非常模糊,可能没用的建议:


(a)你能否知道错误是什么错误?b
累积并在结束时只修改一次结果?


(b)使用定点/多个
精度算术可能会获得更快的结果,并且只有在需要

时才能转换为浮点数。 (令人怀疑,但值得检查。)



Two very generic, very vague, and probably useless suggestions:

(a) Could you perform a chain of calculations knowing what error is
accumulating and correct the result only once at the end?

(b) May be you could get faster results using fixed point / multiple
precision arithmetic, and converting to floating point only when
needed. (Doubtful, but worth checking nevertheless.)


这篇关于乘以2的乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆