在长模式下使用64/32位寄存器时,可能会有任何惩罚吗? [英] May there be any penalties when using 64/32-bit registers in Long mode?

查看:96
本文介绍了在长模式下使用64/32位寄存器时,可能会有任何惩罚吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能这不仅涉及微优化,而且涉及纳米优化,但是这个主题令我感兴趣,我想知道在长模式下使用非本地寄存器大小时是否会有任何惩罚?

Probably this is all about not even micro- but nanooptimizations, but the subject interests me and I would like to know if there are any penalties when using non-native register sizes in long mode?

我从各种来源了解到,部分寄存器更新(例如ax而不是eax)会导致eflags停顿并降低性能.但是我不确定长模式.对于此处理器操作模式,什么寄存器大小被认为是本机的? x86-64仍然是x86体系结构的扩展,因此我相信32位仍然是本地的.还是我错了?

I've learned from various sources, that partial register updates (like ax instead of eax) can cause eflags stall and degrade performance. But I'm not sure about the long mode. What register size is considered native for this processor operation mode? x86-64 are still extensions to x86 architecture, thus I believe 32 bits are still native. Or am I wrong?

例如,类似

sub eax, r14d

sub rax, r14

具有相同的大小,但是使用其中任何一个都会受到罚款吗? 在如下所示的连续指令中混合寄存器大小时,可能会有任何惩罚吗? (假设在所有情况下,高dword均为零)

have the same size, but may there be any penalties when using either of those? May there be any penalties when mixing register sizes in consecutive instructions like the below? (assuming high dword is zero in all cases)

sub ecx, eax
sub r14, rax

推荐答案

在连续的指令中混合使用32位和64位寄存器大小是否会受到惩罚?

May there be any penalties when mixing 32 and 64-bit register sizes in consecutive instructions?

否,写入32位寄存器总是零扩展到完整寄存器,因此x86-64避免了32位和64位指令的部分寄存器惩罚.

No, writing to a 32-bit register always zero-extends to the full register, so x86-64 avoids any partial-register penalties for 32 and 64-bit instruction.

因此,我相信32位还是本地的.

thus I believe 32 bits are still native.

是的,大多数指令的默认操作数大小为32位(

Yes, the default operand-size is 32-bit for most instructions (other than PUSH/POP). 64-bit needs a REX prefix with the W bit set to 1. So prefer 32-bit for code-size reasons. This is why compilers use mov r32, imm32 for addresses of static data (since the default code-model requires that code and static data addresses are in the low 2GiB of virtual address space).

这是AMD的设计选择.他们可以选择另一种方式,并且需要一个前缀来获取32位操作数的大小.由于long模式是一种单独的模式,因此x86-64机器代码可以与x86-32机器代码不同,但是需要这样做. AMD选择最小化差异,以便它们可以在解码器中共享尽可能多的晶体管.您的结论是正确的,但您的推理完全是虚假的.

It was a design choice by AMD. They could have chosen the other way, and required a prefix to get 32-bit operand size. Since long mode is a separate mode, x86-64 machine code can be different from x86-32 machine code however it wants. AMD chose to minimize the differences so they could share as many transistors as possible in the decoders. Your conclusion is correct, but your reasoning is totally bogus.

部分寄存器更新(例如用斧头代替eax)会导致eflag停顿并降低性能.

partial register updates (like ax instead of eax) can cause eflags stall and degrade performance.

部分标志停顿与部分寄存器停顿是分开的.它们在内部的处理方式相似(EFLAGS的单独重命名部分必须与修改后的AX与未修改的EAX高位字节合并一样进行合并). 但是一个不会导致另一个.

Partial-flag stalls are separate from partial-register stalls. They're handled similarly internally (the separately-renamed parts of EFLAGS have to be merged the same as a modified AX has to be merged with the unmodified upper bytes of EAX). But one doesn't cause the other.

# partial-reg stall
setcc   al           # leaves the upper 3 (or 7) bytes unmodified
add     edx, eax     # reads full EAX.  Older CPUs stall while merging

Zeroing EAX ahead of the flag-setting and setcc with xor eax,eax avoids the partial-register penalty entirely. (Core2/Nehalem stalls for fewer cycles than earlier CPUs, but does still stall for 2 or 3c while inserting a merging uop. Sandybridge doesn't stall at all while inserting the merging uop).

(不同CPU上部分寄存器罚款的另一摘要:为什么不使用GCC部分寄存器?,说的基本上是一样的东西.

(Another summary of partial register penalties on different CPUs: Why doesn't GCC use partial registers?, saying basically the same thing).

AMD在以后读取完整寄存器时不会遭受部分寄存器停顿的困扰,但是部分寄存器的写入和读取却对完整寄存器有错误的依赖性. (AMD CPU首先不会单独重命名子寄存器.IntelP4和Silvermont/Knight's Landing的使用方法相同.)

AMD doesn't suffer from partial-register stalls when reading the full register later, but instead partial-register writes and reads have a false dependency on the full register. (AMD CPUs don't rename sub-registers separately in the first place. Intel P4 and Silvermont / Knight's Landing are the same way.)

英特尔Haswell/Skylake(也许还有Ivybridge)根本没有将alrax分别重命名,因此它们不需要合并low8/low16寄存器.但是setcc al对旧值有错误的依赖性.它们仍会重命名并合并ah. ( 详细信息HSW/SKL部分注册性能 .)

Intel Haswell/Skylake (and maybe Ivybridge) don't rename al separately from rax at all, so they never need to merge low8 / low16 registers. But the setcc al has a false dependency on the old value. They do still rename and merge ah. (Details on HSW/SKL partial-reg performance.)

# partial flag stall when reading a flag that didn't come from
# the last instruction to write any flags.
clc
# edi and esi = one-past-the-end of dst and src
# ecx = -count
bigInt_add:
    mov   eax, [esi+ecx*4]
    adc   [edi+ecx*4], eax   # reads CF, partial flag stall on 2nd and later iterations
    inc   ecx                # writes all flags except CF
    jl    bitInt_add         # loop upwards towards zero

请参见此问题与解答; A 进一步讨论Intel Sandybridge与Sandybridge之前有关部分标志的问题.

See this Q&A for more discussion about partial-flags issues on Intel pre-Sandybridge vs. Sandybridge.

另请参见 Agner Fog的microarch pdf ,以及

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆