Aarch64 什么是延迟转发? [英] Aarch64 what is late-forwarding?

查看:50
本文介绍了Aarch64 什么是延迟转发?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

延迟转发"Arm Neoverse E1 核心软件优化指南"(以及他们针对其他一些 CPU 型号的优化指南):

<头>
指导组说明执行延迟执行吞吐量注意事项
乘法累加(32 位)MADD, MSUB3 (2)12
乘法累加(64 位)MADD, MSUB5 (4)1/32

<块引用>

(2) 乘法累加流水线支持从类似的 μOP 中延迟转发累加操作数,允许典型的乘法累加 μOP 序列每 N 个周期发出一个(累积延迟 N 显示在括号中).

术语延迟转发"是什么意思?意思?哪些指令序列会受到延迟转发(反例也有帮助)?

解决方案

乘加运算的延迟转发意味着加数可以在乘法完成后可用,而不必在乘加运算开始时可用执行.由于乘法本身不是依赖于加数的数据,因此它可以继续进行.由于加法的一些工作可以与乘法并行完成(乘积的指数将提前可用,并且可以与加数的指数一起使用以确定加法之前所需的移位量),人们可能希望加数为在整个乘积可用之前可用,但即使在这种情况下,也不需要加数,直到被乘数晚得多.

通过延迟加数的转发(可用性),减少了相关累积的有效延迟.这减少了覆盖延迟所需的累积寄存器(和并行性)的数量.

"Late-forwarding" is mentioned in "Arm Neoverse E1 Core Software Optimization Guide" (as well as in their optimization guides for some other CPU models):

Instruction Group Instructions Exec Latency Exec Throughput Notes
Multiply accumulate (32-bit) MADD, MSUB 3 (2) 1 2
Multiply accumulate (64-bit) MADD, MSUB 5 (4) 1/3 2

(2) Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar μOPs, allowing a typical sequence of multiply-accumulate μOPs to issue one every N cycles (accumulate latency N shown in parentheses).

What does the term "late-forwarding" mean? What sequence of instructions would be subject to late-forwarding (counter-example would also be helpful)?

解决方案

Late forwarding for multiply-add operations means that the addend can be made available after the multiplication has completed rather than having to be available when the multiply-add operation begins execution. Since the multiplication itself is not data dependent on the addend, it can proceed. Since some work for the addition can be done in parallel with the multiplication (the exponent of the product will be available early and can be used with the addend's exponent to determine the amount of shift needed before addition), one may want the addend to be available before the entire product is available, but even in that case the addend is not needed until much later than the multiplicands.

By delaying the forwarding (availability) of the addend, the effective latency of dependent accumulations is reduced. This reduces the number of accumulation registers (and parallelism) one needs to cover the latency.

这篇关于Aarch64 什么是延迟转发?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆