内存目标BTS如何比加载/BTS reg,reg/store慢得多? [英] How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

查看:48
本文介绍了内存目标BTS如何比加载/BTS reg,reg/store慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一般情况下,使用内存操作数的指令可以占用内存或寄存器操作数的速度怎么会比mov + mov->慢呢?指令->mov + mov

In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov

基于 Agner Fog的指令表中的吞吐量和延迟(请参见以我为例的Skylake,第238页)我看到 btr/bts 指令的以下数字:

Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions:

instruction, operands, uops fused domain, uops unfused domain, latency, throughput
mov          r,r       1                  1                    0-1      .25
mov          m,r       1                  2                    2        1
mov          r,m       1                  1                    2        .5
... 
bts/btr      r,r       1                  1                    N/A      .5
bts/btr      m,r       10                 10                   N/A      5

我看不出这些数字可能是正确的.即使在最坏的情况下,也没有多余的寄存器,并且您已经将一个寄存器存储在一个临时存储位置中,这样做会更快:

I dont see how these numbers could possibly be correct. Even in the worst case where there are no registers to spare and you have store one in a temporary memory location it would be faster to:

## hypothetical worst-case microcode that saves/restores a scratch register
mov m,r  // + 1  throughput , save a register
mov r,m  // + .5 throughput , load BTS destination operand
bts r,r  // + 1  throughput , do bts (or btr)
mov m,r  // + 1  throughput , store result
mov r,m  // + .5 throughput , restore register

最坏的情况是吞吐量要比 bts m,r (4< 5)更好.(编者注:当吞吐量有不同的瓶颈时,加起来的吞吐量是行不通的.您需要考虑uops和端口;此顺序应该是2c吞吐量,瓶颈是1/clock存储吞吐量.)

As the worst case this has a better throughput than just bts m,r (4 < 5). (Editor's note: adding up throughputs doesn't work when they have different bottlenecks. You need to consider uops and ports; this sequence should be 2c throughput, bottlenecked on 1/clock store throughput.)

并且微代码指令具有一组自己的寄存器,因此,看来不太可能实际需要这样做.谁能解释为什么 bts (或一般而言,任何指令)与使用最坏情况下的移动策略相比,使用内存,寄存器操作数可以具有更高的吞吐量.

And microcode instructions have there own set of registers so it seems aggressively unlikely this would actually be needed. Can anyone explain why bts (or in general any instruction) could have higher throughput with memory, register operands than using the worst case moving policy.

(编者注:是的,微码可以使用一些隐藏的临时寄存器.类似 add [mem],reg 的操作至少在逻辑上只是加载到其中一个中,然后存储结果.)

(Editor's note: yes, there are a few hidden temp register that microcode can use. Something like add [mem], reg does at least logically just load into one of those and then store the result.)

推荐答案

您缺少的是BT,BTC,BTS和BTR不能像使用内存操作数时所描述的那样工作.您假设内存版本与寄存器版本相同,但事实并非如此.对于寄存器版本,使用的第二个操作数的值取64(或16或32)为模.对于内存版本,第二个操作数的值照原样使用.这意味着该指令访问的实际内存位置可能不是该内存操作数给定的地址,而是它后面的某个地址.

What you're missing is that BT, BTC, BTS and BTR don't work like you described when a memory operand is used. You're assuming the memory versions work the same as the register versions, but that's not quite the case. With the register version, the value of the second operand is used is taken modulo 64 (or 16 or 32). With the memory version, the value of the second operand is used as is. This means that the actual memory location accessed by the instruction may not be the address given by the memory operand, but one somewhere past it.

例如,忽略使用BTS的寄存器版本来保存 BTS [rsi + rdi],rax 的相同操作,无需保存寄存器和原子性,您需要执行以下操作这个:

For example, ignoring the need to save registers and atomicity, to get the same operation of BTS [rsi + rdi], rax using the register version of BTS you'd need to do something like this:

LEA rbx, [rsi + rdi]
MOV rcx, rax
SHR rcx, 8
MOV rdx, [rbx + rcx]
BTS rdx, rax
MOV [rbx + rcx], rdx

如果您知道RAX的值小于64,或者它是一个更简单的内存操作数,则可以简化此操作.确实,您已经注意到,在这种情况下,使用较快的寄存器版本而不是较慢的存储器版本可能是一个优势,即使这意味着需要更多指令.

You can simplify this if you know the value of RAX is less than 64, or if it's a simpler memory operand. Indeed as you've noticed, it may be an advantage in cases like these to use the faster register version over the slower memory version even if it means a few more instructions.

这篇关于内存目标BTS如何比加载/BTS reg,reg/store慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆