循环优化.如何注册重命名中断依赖关系?什么是执行端口容量? [英] Loop optimization. How does register renaming break dependencies? What is execution port capacity?

查看:76
本文介绍了循环优化.如何注册重命名中断依赖关系?什么是执行端口容量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在分析一个来自Agner Fog的Optimization_assembly的循环示例.我的意思是12.9章. 代码是:(我简化了一点)

L1: 
    vmulpd ymm1, ymm2, [rsi+rax] 
    vaddpd ymm1, ymm1, [rdi+rax] 
    vmovupd [rdi+rax], ymm1
    add rax, 32  
    jl L1   

我有一些问题:

  1. 作者说,没有循环携带的依赖关系.我不明白为什么会这样. (我跳过了add rax, 32的情况(它确实是循环携带的,但是只有一个循环)).但是,毕竟,下一次迭代无法在之前的迭代尚未完成之前修改ymm1寄存器.也许寄存器重命名在这里起作用?

  2. 让我们假设存在循环携带的依赖关系. vaddpd ymm1, ymm1, [rdi+rax] -> vmovupd [rdi+rax], ymm1

让第一个延迟为3,第二个延迟为7.

(实际上,没有这种依赖性,但是我想提出一个假设性的问题)

现在,如何确定总延迟.我应该添加延迟,结果将是10?我不知道.

  1. 它写为:

有两个256位读取操作,每个操作使用两个读取端口 连续的时钟周期,在表中以1+表示.使用 两个读取端口(端口2和3),我们将有两个吞吐量 在两个时钟周期内进行256位读取.读取端口之一将使 第二个时钟周期内写操作的地址计算.写 256位写操作将端口(端口4)占用两个时钟周期. 限制因素将是读写操作,使用 两个读取端口和一个写入端口处于最大容量.

端口的容量到底是什么?如何确定它们,例如IvyBridge(我的CPU).

解决方案

  1. 是的,寄存器重命名的全部目的是在一条指令不依赖旧值的情况下写入寄存器时打破依赖链. mov的目标或AVX指令的只写目标操作数就是这样.还清零习惯用法,例如为什么mulss在Haswell上只需要3个周期,而不同于Agner的指令表? (展开具有多个累加器的FP循环)以更详细地描述寄存器重命名,以及一些同时运行多个循环携带的依赖链的性能实验.

  2. 如果不重命名,则vmulpdvmovupd读取其操作数之前不能写入ymm1(寄存器重命名的乱序CPU.

    更新:早期的OoO CPU使用记分板来执行一些有限的无序执行无需重命名寄存器,但查找和利用指令级并行性的能力受到更大限制.

  3. IvB上的两个加载端口中的每个端口每个时钟具有一个128b负载的容量.而且每个时钟产生一个地址.

    理论上,SnB/IvB只能通过使用256b指令来维持每个时钟2x 128b负载和1x 128b存储的吞吐量.它们每个时钟只能生成两个地址,但是256b负载或存储仅需要每2个数据传输周期计算一次地址.请参阅 Agner Fog的微体系结构指南

    Haswell在端口7上添加了一个专用存储AGU,该AGU仅处理简单的寻址模式,并将数据路径扩展到256b.一个周期可以达到96个字节的总加载+存储峰值. (但是一些未知的瓶颈将持续的吞吐量限制为小于此值.在Skylake客户端上,英特尔报告的周期约为84字节,并且与我的测试相符.)

    (据最近对英特尔优化指南的更新,据报道,IceLake客户端可以承受每个周期2x64B加载+ 1x64B存储或2x32B存储.)


还请注意,您的索引编址模式不会微融合 a>,因此融合域uop吞吐量也是一个问题.

I am analyzing an example of a loop from Agner Fog's optimization_assembly. I mean the 12.9 chapter. The code is: ( I simplified a bit)

L1: 
    vmulpd ymm1, ymm2, [rsi+rax] 
    vaddpd ymm1, ymm1, [rdi+rax] 
    vmovupd [rdi+rax], ymm1
    add rax, 32  
    jl L1   

And I have some questions:

  1. The author said that there is no loop-carried dependency. I don't understand why it is so. ( I skipped the case of add rax, 32 ( it is loop-carried indeed, but only one cycle)). But, after all, the next iteration cannot modify ymm1 register before the previous iteration will not have finished. Maybe register-renaming plays a role here?

  2. Let's assume that there is a loop-carried dependency. vaddpd ymm1, ymm1, [rdi+rax] -> vmovupd [rdi+rax], ymm1

And let latency for first is 3, and latency for second is 7.

( In fact, there is no such dependency, but I would like to ask a hypothetical question)

Now, How to determine a total latency. Should I add latencies and the result would be 10? I have no idea.

  1. It is written:

There are two 256-bit read operations, each using a read port for two consecutive clock cycles, which is indicated as 1+ in the table. Using both read ports (port 2 and 3), we will have a throughput of two 256-bit reads in two clock cycles. One of the read ports will make an address calculation for the write in the second clock cycle. The write port (port 4) is occupied for two clock cycles by the 256-bit write. The limiting factor will be the read and write operations, using the two read ports and the write port at their maximum capacity.

What exactly is capacity for ports? How can I determine them, for example for IvyBridge (my CPU).

解决方案

  1. Yes, the whole point of register renaming is to break dependency chains when an instruction writes a register without depending on the old value. The destination of a mov, or the write-only destination operand of AVX instructions, is like this. Also zeroing idioms like xor eax,eax are recognized as independent of the old value, even though they appear to have the old value as an input.

    See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for a more detailed description of register-renaming, and some performance experiments with multiple loop-carried dependency chains in flight at once.

  2. Without renaming, vmulpd couldn't write ymm1 until vmovupd had read its operand (Write-After-Read hazard), but it wouldn't have to wait for vmovupd to complete. See a computer architecture textbook to learn about in-order pipelines and stuff. I'm not sure if any out-of-order CPUs without register renaming exist.

    update: early OoO CPUs used scoreboarding to do some limited out-of-order execution without register renaming, but were much more limited in their capacity to find and exploit instruction-level parallelism.

  3. Each of the two load ports on IvB has a capacity of one 128b load per clock. And also of one address-generation per clock.

    In theory, SnB/IvB can sustain a throughput of 2x 128b load and 1x 128b store per clock, but only by using 256b instructions. They can only generate two addresses per clock, but a 256b load or store only needs one address calculation per 2 cycles of data transfer. See Agner Fog's microarch guide

    Haswell added a dedicated store AGU on port 7 that handles simple addressing modes only, and widened the data paths to 256b. A single cycle can do a peak of 96 bytes total loaded + stored. (But some unknown bottleneck limits sustained throughput to less than that. On Skylake-client, about 84 bytes / cycle reported by Intel, and matches my testing.)

    (IceLake client reportedly can sustain 2x64B loaded + 1x64B stored per cycle, or 2x32B stored, according to a recent update to Intel's optimization guide.)


Also note that your indexed addressing modes won't micro-fuse, so fused-domain uop throughput is also a concern.

这篇关于循环优化.如何注册重命名中断依赖关系?什么是执行端口容量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆