RMW指令是否被认为对现代x86有害? [英] Are RMW instructions considered harmful on modern x86?

查看:143
本文介绍了RMW指令是否被认为对现代x86有害?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我记得在为速度优化x86时通常应避免读-修改-写指令.也就是说,应避免使用add [rsi], 10之类的东西,它会增加存储在rsi中的存储位置.建议通常是将其拆分为一条读-修改指令,然后是一个存储区,如下所示:

I recall that read-modify-write instructions are generally to be avoided when optimizing x86 for speed. That is, you should avoid something like add [rsi], 10, which adds to the memory location stored in rsi. The recommendation was usually to split it into a read-modify instruction, followed by a store, so something like:

mov rax, 10
add rax, [rsp]
mov [rsp], rax

或者,您可以使用显式加载和存储以及reg-reg添加操作:

Alternately, you might use explicit load and stores and a reg-reg add operation:

mov rax, [esp]
add rax, 10
mov [rsp], rax

对于现代x86,这仍然是合理的建议吗(并且曾经吗?)? 1

Is this still reasonable advice (and was it ever?) for modern x86?1

当然,在内存中的值被多次使用的情况下,RMW是不合适的,因为这将导致冗余的加载和存储.我对仅使用一次值的情况感兴趣.

Of course, in cases where a value from memory is used more than once, RMW is inappropriate, since you will incur redundant loads and stores. I'm interested in the case where a value is only used once.

基于对Godbolt的探索,所有icc,clang和gcc

Based on exploration in Godbolt, all of icc, clang and gcc prefer to use a single RMW instruction to compile something like:

void Foo::f() {
  x += 10;
}

进入:

Foo::f():
    add     QWORD PTR [rdi], 10
    ret

因此,至少当值仅使用一次时,至少大多数编译器似乎认为RMW很好.

So at least most compilers seem to think RMW is fine, when the value is only used once.

足够有趣的是,当增量值是全局值而不是成员值时,例如

,各种编译器不同意:

Interestingly enough, the various compilers do not agree when the incremented value is a global, rather than a member, such as:

int global;

void g() {
  global += 10;
}

在这种情况下,gccclang仍然是单个RMW指令,而

In this case, gcc and clang still a single RMW instruction, while icc prefers a reg-reg add with explicit loads and stores:

g():
        mov       eax, DWORD PTR global[rip]                    #5.3
        add       eax, 10                                       #5.3
        mov       DWORD PTR global[rip], eax                    #5.3
        ret     

也许与RIP相对寻址和微融合限制有关?但是,icc13对-m32仍然执行相同的操作,因此,可能与需要32位位移的寻址模式有关.

Perhaps it is something to do with RIP relative addressing and micro-fusion limitations? However, icc13 still does the same thing with -m32 so perhaps it's more to do with the addressing mode requiring a 32-bit displacement.

1 我使用的是故意模糊的术语现代x86 ,基本上是指最后几代Intel和AMD笔记本电脑/台式机/服务器芯片.

1I'm using the deliberately vague term modern x86 to basically mean the last few generations of Intel and AMD laptop/desktop/server chips.

推荐答案

RMW指令是否被认为对现代x86有害?

Are RMW instructions considered harmful on modern x86?

否.

在现代x86/x64上,输入指令被翻译成uops.
任何RMW指令都将分解为多个微指令;实际上,将单独的指令分解为相同的指令.

On modern x86/x64 the input instructions are translated into uops.
Any RMW instruction will be broken down into a number of uops; in fact into the same uops that separate instructions would be broken down into.

通过使用复杂" RMW指令而不是单独的简单"读取,修改和写入指令,您将获得以下内容.

By using a 'complex' RMW instruction instead of separate 'simple' read, modify and write instructions you gain the following.

  1. 更少的指令进行解码.
  2. 更好地利用指令缓存
  3. 更好地利用可寻址寄存器

您可以在 Agner Fog的说明表中清楚地看到这一点.

You can see this quite clearly in Agner Fog's instruction tables.

ADD [mem],const的延迟时间为5个周期.

ADD [mem],const has a latency of 5 cycles.

MOV [mem],reg和反之亦然,每个延迟都有2个周期,而ADD reg,const的延迟只有1个周期,总共5个周期.

MOV [mem],reg and visa versa has a latency of 2 cycles each and an ADD reg,const has a latency of 1 for a total of 5.

我检查了Intel Skylake的计时,但是AMD K10相同.

I checked the timings for Intel Skylake, but AMD K10 is the same.

您需要考虑到编译器必须迎合许多不同的处理器,并且某些编译器甚至针对不同的处理器系列使用相同的核心逻辑.这可能导致相当不理想的策略.

You need to take into account that compilers have to cater to many different processors and some compilers even use the same core logic for different processor families. This can lead to quite suboptimal strategies.

RIP相对地址
在X64 RIP上,相对寻址需要一个额外的周期才能解决较旧处理器上的RIP.
Skylake没有这种延迟,我相信其他人也会消除这种延迟.
我确定您知道x86不支持EIP相对寻址.在X86上,您必须以一种绕行方式进行操作.

RIP relative addressing
On X64 RIP relative addressing takes an extra cycle to resolve RIP on older processors.
Skylake does not have this delay and I'm sure others will eliminate the delay as well.
I'm sure you're aware that x86 does not support EIP relative addressing; on X86 you have to do this in a round-about fashion.

这篇关于RMW指令是否被认为对现代x86有害?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆