如何链接多个fma操作一起的性能? [英] How to chain multiple fma operations together for performance?

查看:155
本文介绍了如何链接多个fma操作一起的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设在一些C或C ++代码中,我有一个名为 T fma(T a,T b,T c)的函数,执行1乘法和1加法, (a * b)+ c ;我应该如何优化多个 mul&



例如,我的算法需要实现3或4个fma操作链接和汇总在一起,我如何写这是一个有效的方法以及在语法或语义的哪一部分我应该特别注意?



我还需要一些提示关键的部分:避免更改CPU的舍入模式避免刷新cpu管道。但我确定只是在多次调用 fma 之间使用 + 操作不应该改变因为我没有太多的CPU来测试这个,我只是遵循一些逻辑步骤。



我的算法是多个fma调用的总和。

  fma(triplet 1)+ fma(triplet 2)+ fma三分线3)


解决方案

(。在我的谦卑的看法,虽然,FMA等不是你应该担心的,除非你是一个C ++编译器程序员。如果你不是,让编译器优化照顾这些技术性。一般来说,在这种关注的情况下,所有邪恶的根源(即,过早的优化),改写一个伟大的(即唐纳德·克努特)。


Assuming that in some C or C++ code I have a function named T fma( T a, T b, T c ) that performs 1 multiplication and 1 addition like so ( a * b ) + c ; how I'm supposed to optimize multiple mul & add steps ?

For example my algorithm needs to be implemented with 3 or 4 fma operations chained and summed together, How I can write this is an efficient way and at what part of the syntax or semantics I should dedicate particular attention ?

I also would like some hints on the critical part: avoid changing the rounding mode for the CPU to avoid flushing the cpu pipeline. But I'm quite sure that just using the + operation between multiple calls to fma shouldn't change that, I'm saying "quite sure" because I don't have too many CPUs to test this, I'm just following some logical steps.

My algorithm is something like the total of multiple fma calls

fma ( triplet 1 ) + fma ( triplet 2 ) + fma ( triplet 3 )

解决方案

Recently, in Build 2014 Eric Brumer gave a very nice talk on the topic (see here). The bottom line of talk was that

Using Fused Multiply Accumulate (aka FMA) everywhere hurts performance.

In Intel CPUs a FMA instruction costs 5 cycles. Instead doing a multiplication (5 cycles) and an addition (3 cycles) costs 8 cycles. Using FMA your are getting two operations in the prize of one (see picture below).

However, FMA seems not to be the holly grail of instructions. As you can see in the picture below FMA can in certain citations hurt the performance.

In the same fashion, your case fma(triplet1) + fma(triplet2) + fma(triplet 3) costs 21 cycles whereas if you were to do the same operations with out FMA would cost 30 cycles. That's a 30% gain in performance.

Using FMA in your code would demand using compiler intrinsics. In my humble opinion though, FMA etc. is not something you should be worried about, unless you are a C++ compiler programmer. If your are not, let the compiler optimization take care of these technicalities. Generally, under such kind of concerns lies the root of all evil (i.e., premature optimization), to paraphrase one of the great ones (i.e., Donald Knuth).

这篇关于如何链接多个fma操作一起的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆