微架构归零通过寄存器更名寄存器的:性能与一个MOV? [英] Microarchitectural zeroing of a register via the register renamer: performance versus a mov?

查看:325
本文介绍了微架构归零通过寄存器更名寄存器的:性能与一个MOV?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读的在博客帖子,最近的X86微架构还能够处理常见的寄存器清零成语(如异或与自身的寄存器)登记更名;在笔者的话:


  

寄存器更名也知道如何执行这些指令 - 它可以零寄存器本身


是否有人知道这是如何在实践中?我知道,有些国际审计准则,比如MIPS,包含始终在硬件设置为零的建筑名册;这是否意味着在内部,X86微架构也有类似的零注册的内部寄存器映射到时候方便呢?或者是我的心智模型不是这个东西是如何工作的microarchitecturally完全正确?

为什么我问的原因是因为(一些观察)似乎是一个 MOV 含有零到目的地的一个寄存器,在一个循环中,基本上仍是比内环路通过XOR归零寄存器更快。

基本上它发生的事情是,我想这取决于条件的循环中零的登记册;这可以通过提前分配一个结构寄存器以存储零来完成(%XMM3 ,在这种情况下),这是不修改,以便在循环的整个持续时间,并执行其中的以下内容:


  MOVAPD%XMM3,%XMM0


或替代的XOR技巧:


  xorpd%XMM0,%XMM0


(无论AT& T公司语法)。

在换句话说选择是提升一个恒定的零外循环或内它rematerializing它每次迭代之间。后者由一个减少的活结构寄存器的数量,并与由处理器假想特例认识和处理的异或成语的,它的似乎的像它应该是一样快前者(特别是因为这些机器比建筑寄存器的更多的物理寄存器无论如何,所以它应该能够在内部完成了相当于什么,我在大会中提升出来的常量零,甚至更好,内部完成,充分认识和控制上自身的资源)。但它似乎没有要,所以我很好奇,如果任何人以CPU架构的知识可以解释,如果有应该是一个很好的理论依据。

<子>在这种情况下,寄存器碰巧通过SSE寄存器和机器恰好是常春藤桥;我不知道有多么重要要么这些因素。


解决方案

摘要:您可以运行多达四个异斧,斧每个周期的指令相比,速度较慢 MOV眼前,章的说明。

详细资料和参考文献:

维基百科有寄存器重命名一般一个很好的概述:<一href=\"http://en.wikipedia.org/wiki/Register_renaming\">http://en.wikipedia.org/wiki/Register_renaming

Torbj¨ornGranlund公司的定时为
指令延迟和吞吐量
AMD和英特尔的x86处理器是: http://gmplib.org/~tege/x86-timing.pdf

瓦格纳雾覆盖很好的细节在他的微架构的研究


  

8.8寄存器分配和重命名


  
  

寄存器重命名由寄存器别名表(RAT)来控制和
  重新排序缓冲器(ROB)......从去codeRS和栈的μops
  发动机通过一个队列转到RAT和然后到ROB读和
  保留站。鼠年可以处理每个时钟周期4μops。该
  RAT可以重命名每个时钟周期四个寄存器,它甚至可以重命名
  同一寄存器在一个时钟周期的四倍。


  
  独立的

特殊情况


  
  

寄存器设置为零的一种常见方法是用本身它异或运算
  还是从自身减去它,例如
  XOR EAX,EAX。在Sandy Bridge处理器识别某些
  指令是独立的寄存器,如果事先值的
  两个操作数寄存器是相同的。该寄存器被设置为零
  在不使用任何执行单元的重命名阶段。这适用于
  所有的以下说明:XOR,SUB,PXOR,XORPS,XORPD,
  VXORPS,VXORPD和PSUBxxx和PCMPGTxx的所有变体,但不
  PANDN等。


  
  

需要没有执行单元说明


  
  

,其中寄存器由如XOR指令设定为零的上述特殊的情况
  EAX,EAX是在寄存器重命名处理/分配阶段不
  使用任何执行单元。这使得使用这些归零的
  说明效率极高,具有吞吐量四归零
  每个时钟周期instructons。


I read on a blog post that recent X86 microarchitectures are also able to handle common register zeroing idioms (such as xor-ing a register with itself) in the register renamer; in the words of the author:

"the register renamer also knows how to execute these instructions – it can zero the registers itself."

Does anybody know how this works in practice? I know that some ISAs, like MIPS, contain an architectural register that is always set to zero in hardware; does this mean that internally, the X86 microarchitecture has similar "zero" registers internally that registers are mapped to when convenient? Or is my mental model not quite correct on how this stuff works microarchitecturally?

The reason why I am asking is because (from some observation) it seems that a mov from one register containing zero to a destination, in a loop, is still substantially faster than zeroing the register via xor within the loop.

Basically what it happening is that I would like to zero a register within a loop depending on a condition; this can either be done by allocating an architectural register ahead of time to store zero (%xmm3, in this case), which is not modified for the entire duration of the loop, and executing the following within it:

movapd  %xmm3, %xmm0

or instead with the xor trick:

xorpd   %xmm0, %xmm0

(Both AT&T syntax).

In other words choice is between hoisting a constant zero outside of the loop or rematerializing it within it for each iteration. The latter reduces the number of live architectural registers by one, and, with the supposed special case awareness and handling of the xor idiom by the processor, it seems like it ought to be as fast as the former (especially since these machines have more physical registers than architectural registers anyway, so it should be able to internally do the equivalent to what I've done in the assembly by hoisting out the constant zero or even better, internally, with full awareness and control over its own resources). But it doesn't seem to be, so I'm curious if anyone with CPU architecture knowledge can explain if there's a good theoretical reason for that.

The registers in this case happen to by SSE registers and the machine happens to be Ivy Bridge; I'm not sure how important either of those factors are.

解决方案

Executive summary: You can run up to four xor ax, ax instructions per cycle as compared to the slower mov immediate, reg instructions.

Details and references:

Wikipedia has a nice overview of register renaming in general: http://en.wikipedia.org/wiki/Register_renaming

Torbj¨orn Granlund's timings for instruction latencies and throughput for AMD and Intel x86 processors are at: http://gmplib.org/~tege/x86-timing.pdf

Agner Fog nicely covers the specifics in his Micro-architecture study:

8.8 Register allocation and renaming

Register renaming is controlled by the register alias table (RAT) and the reorder buffer (ROB) ... The µops from the decoders and the stack engine go to the RAT via a queue and then to the ROB-read and the reservation station. The RAT can handle 4 µops per clock cycle. The RAT can rename four registers per clock cycle, and it can even rename the same register four times in one clock cycle.

Special cases of independence

A common way of setting a register to zero is by XOR'ing it with itself or subtracting it from itself, e.g. XOR EAX,EAX. The Sandy Bridge processor recognizes that certain instructions are independent of the prior value of the register if the two operand registers are the same. This register is set to zero at the rename stage without using any execution unit. This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, VXORPS, VXORPD and all variants of PSUBxxx and PCMPGTxx, but not PANDN etc.

Instructions that need no execution unit

The abovementioned special cases where registers are set to zero by instructions such as XOR EAX,EAX are handled at the register rename/allocate stage without using any execution unit. This makes the use of these zeroing instructions extremely efficient, with a throughput of four zeroing instructons per clock cycle.

这篇关于微架构归零通过寄存器更名寄存器的:性能与一个MOV?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆