FPU与软件仿真的性能比较 [英] Performance comparison of FPU with software emulation

查看:247
本文介绍了FPU与软件仿真的性能比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然我知道(所以我被告知)浮点协处理器的运行速度比任何浮点运算的软件实现都要快,但我完全没有感觉到这种差异有多大,依次是大小。



答案可能取决于微处理器和超级计算机之间的应用程序和工作地点。我对计算机模拟特别感兴趣。

您能指出这个问题的文章或论文吗?

解决方案

然而,根据我的理解,在处理器中,不要在硬件中实现浮点(FP)操作,软件实现通常会比整型操作慢10到100倍(甚至更糟,如果实现是坏的),这些操作总是在硬件在CPU上。

确切的性能取决于许多因素,例如整数硬件的功能 - 一些CPU不具有FPU,但在其中具有功能整数运算,帮助实现FP计算的快速软件仿真。



njuffa, XScale 处理器,延迟列表(摘录):

 整数加法或减法:1个循环
整数乘法:2-6个循环
fp加法(仿真):34个循环
fp multiplication(emulated):35 cycles

所以这会导致10-30之间的因数整数和FP算术。本文还提到GNU实现(GNU编译器默认使用的实现)大约慢了10倍,这是总计100-300倍。请注意,以上是针对编译器将FP仿真编译到程序中的情况。某些操作系统(例如Linux和WindowsCE)在OS内核中也有一个FP仿真 。其优点是,即使没有FP仿真(即使用FPU指令)编译的代码也可以在没有FPU的情况下运行 - 内核将在软件中透明地模拟不支持的FPU指令。然而,由于额外的开销,这种仿真比编入程序的软件仿真更慢(大约是另一个因子10)。很明显,这种情况只与一些处理器具有FPU的处理器体系结构有关,而另一些则不具有(例如x86和ARM)。

注意: / em>这个答案比较(仿真的)FP操作在同一个处理器上的整数操作的性能。您的问题也可能被认为是关于(仿真的)FP操作与硬件FP操作(不知道你的意思)相比的性能
。但是,结果大致相同,因为如果FP在硬件中实现,它通常(几乎)与整数操作一样快。


While I know (so I have been told) that Floating-point coprocessors work faster than any software implementation of floating-point arithmetic, I totally lack the gut feeling how large this difference is, in order of magnitudes.

The answer probably depends on the application and where you work, between microprocessors and supercomputers. I am particularly interested in computer simulations.

Can you point out articles or papers for this question?

解决方案

A general answer will obviously very vague, because performance depends on so many factors.

However, based on my understanding, in processors that do not implement floating point (FP) operations in hardware, a software implementation will typically be 10 to 100 times slower (or even worse, if the implementation is bad) than integer operations, which are always implemented in hardware on CPUs.

The exact performance will depend on a number of factors, such as the features of the integer hardware - some CPUs lack a FPU, but have features in their integer arithmetic that help implement a fast software emulation of FP calculations.

The paper mentioned by njuffa, Cristina Iordache and Ping Tak Peter Tang, An Overview of Floating-Point Support and Math Library on the Intel XScale Architecture supports this. For the Intel XScale processor the list as latencies (excerpt):

integer addition or subtraction:  1 cycle
integer multiplication:           2-6 cycles
fp addition (emulated):           34 cycles
fp multiplication (emulated):     35 cycles

So this would result in a factor of about 10-30 between integer and FP arithmetic. The paper also mentions that the GNU implementation (the one the GNU compiler uses by default) is about 10 times slower, which is a total factor of 100-300.

Finally, note that the above is for the case where the FP emulation is compiled into the program by the compiler. Some operating systems (e.g. Linux and WindowsCE) also have an FP emulation in the OS kernel. The advantage is that even code compiled without FP emulation (i.e. using FPU instructions) can run on a process without an FPU - the kernel will transparently emulate unsupported FPU instructions in software. However, this emulation is even slower (about another factor 10) than a software emulation compiled into the program, because of additional overhead. Obviously, this case is only relevant on processor architectures where some processors haven an FPU, and some do not (such as x86 and ARM).

Note: This answer compares the performance of (emulated) FP operations with integer operations on the same processor. Your question might also be read to be about the performance of (emulated) FP operations compared to hardware FP operations (not sure what you meant). However, the result would be about the same, because if FP is implemented in hardware, it is typically (almost) as fast as integer operations.

这篇关于FPU与软件仿真的性能比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆