对于Intel Haswell上的XMM/YMM FP操作，可以使用FMA代替ADD吗? [英] For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

查看：164 发布时间：2020/7/22 23:45:47 sse avx throughput flops fma

本文介绍了对于Intel Haswell上的XMM/YMM FP操作，可以使用FMA代替ADD吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

此问题适用于Haswell上具有XMM/YMM寄存器的打包式单精度浮动运算.

This question is for packed, single-prec floating ops with XMM/YMM registers on Haswell.

因此，根据 awesome ， awesome ，表由Agner Fog组合而成，我知道MUL可以在端口p0和p1上完成(recp吞吐量为0.5)，而只有ADD可以在端口p1上完成(recp吞吐量为1).除了这个限制，我可以，但我也知道FMA可以在端口p0或p1上完成(recp吞吐量为0.5).因此，当FMA可以使用p0或p1且同时执行ADD和MUL时，为什么普通的ADD仅限于p1，这使我感到困惑.我是不是误会了桌子?或者有人可以解释为什么会这样?

So according to the awesome, awesome table put together by Agner Fog, I know that MUL can be done on either port p0 and p1 (with recp thruput of 0.5), while only ADD is done on only port p1 (with recp thruput of 1). I can except this limitation, BUT I also know that FMA can be done on either port p0 or p1 (with recp thruput of 0.5). So it is confusing to my as to why a plain ADD would be limited to only p1, when FMA can use either p0 or p1 and it does both ADD and MUL. Am I misunderstanding the table? Or can someone explain why that would be?

也就是说，如果我的阅读是正确的，为什么英特尔不只是将FMA op用作纯MUL和纯ADD的基础，从而增加ADD和MUL的吞吐量.或者，是什么使我无法使用两个同时发生的独立FMA操作来模拟两个同时发生的独立ADD操作呢?进行按FMA的ADD处罚有哪些?显然，使用的寄存器数量更多(ADD为2 reg，而FMA为ADD则为3 reg)?

That is, if my reading is correct, why wouldn't Intel just use FMA op as the basis for both plain MUL and plain ADD, and thereby increasing thruput of ADD as well as MUL. Alternatively, what would stop me from using two simultaneous, independent FMA ops to emulate two simultaneous, independent ADD ops? What are the penalties associated with doing ADD-by-FMA? Obviously, there is a greater number of registers used (2 reg for ADD vs 3 reg for ADD-by-FMA), but other than that?

推荐答案

您不是唯一对Intel为什么这么做感到困惑的人. Agner Fog在其针对Haswell的微体系结构手册中写道:

You're not the only one confused as to why Intel did this. Agner Fog in his micro-architecture manual writes for Haswell:

奇怪的是，只有一个端口用于浮点加法，但是有两个端口用于浮点乘法.

It is strange that there is only one port for floating point addition, but two ports for floating point multiplication.

在Agner的留言板上他也写

On Agner's message board he also writes

浮点乘法和融合乘加运算有两个执行单元，但浮点加法只有一个执行单元.由于浮点代码通常包含比乘法更多的加法运算，因此这种设计似乎不是最优的.

There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications.

该主题将继续提供有关该主题的更多信息，我建议您阅读该主题，但在此不再赘述.

That thread continues with more information on the subject which I suggest you read but I won't quote here.

他也在此答案中对此进行了讨论 flops-per-cycle-for-sandy- Bridge-and-haswell-sse2-avx-avx2

He also discusses it in this answer here flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2

Haswell上FMA指令的延迟为5，每时钟吞吐量为2.这意味着您必须保持10个并行操作才能获得最大吞吐量.例如，如果您想添加很长的f.p.列表.数字，则必须将其分成十个部分，并使用十个累加器寄存器.

The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of f.p. numbers, you would have to split it in ten parts and use ten accumulator registers.

这确实是可能的，但是谁会为一个特定的处理器做出如此怪异的优化呢?

This is possible indeed, but who would make such a weird optimization for one specific processor?

他的回答基本上可以回答您的问题.您可以使用FMA将添加的吞吐量提高一倍.实际上，我在吞吐量测试中进行了此操作以进行加法运算，并且确实看到它增加了一倍.

His answer there basically answers your question. You can use FMA to double the throughput of addition. In fact I do this in my throughput tests for addition and indeed see that it doubles.

总而言之，此外，如果您的计算受延迟限制，则不要使用FMA并使用ADD.但是，如果吞吐量受到限制，则可以尝试使用FMA(通过将乘数设置为1.0)，但是您可能必须使用许多AVX寄存器来执行此操作.

In summary, for addition, if your calculation is latency bound then don't use FMA use ADD. But If it's throughput bound you can try and use FMA (by setting the multiplier to 1.0) but you will probably have to use many AVX registers to do this.

我已展开10次以在此处获得最大吞吐量

I unrolled 10 times to get maximum througput here loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell

这篇关于对于Intel Haswell上的XMM/YMM FP操作，可以使用FMA代替ADD吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

对于Intel Haswell上的XMM/YMM FP操作，可以使用FMA代替ADD吗? [英] For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

对于Intel Haswell上的XMM/YMM FP操作，可以使用FMA代替ADD吗? [英] For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭