处理器可以同时进行内存和算术运算吗? [英] Is processor can do memory and arithmetic operation at the same time?

查看:205
本文介绍了处理器可以同时进行内存和算术运算吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在对汇编器和处理器的研究中,有一件事带我走,如何完成指令:

add mem, 1

在我的脑海中,处理器无法加载内存值 ,并且无法在同一条指令中进行算术运算.所以我认为它的发生是这样的:

mov reg, mem
add reg, 1
mov mem, reg

如果我考虑使用 RISC管道的处理器,我们会发现一些停滞.像i++这样简单的指令是令人惊讶的:

|  Fetch  | Decode  | Exec    | Memory  | WriteB  |
          |  Fetch  |         |         | Decode  | Exec    | Memory  | WriteB  |
                    |  Fetch  |         |         |         | Decode  | Exec    | Memory  | WriteB  |

(正如我在Patterson的《计算机体系结构:定量方法》一书中所读到的那样,寄存器是在 Decode uOp中读取,在uOp中存储/加载到内存中的,我们允许我们自己在内存uOp处获取寄存器的值.)

我是对的吗?还是现代处理器具有特定的方法来更有效地做到这一点?

解决方案

您是对的,现代的x86会将add dword [mem], 1解码为3微妙:加载,ALU添加和存储.

这3个依赖操作不能同时发生,因为后面的操作必须等待前面的操作的结果.

但是独立指令的执行可能会重叠,并且现代CPU非常积极地寻找并利用指令级并行性"来使您的代码每时钟运行速度快于1 uop.请参阅此答案的简介一个CPU内核可以并行执行,并具有指向更多内容的链接,例如 Agner Fog的x86微体系结构指南,以及David Kanter对 Sandybridge

为了增加前端吞吐量,可以将存储地址和存储数据微指令解码为微融合对.对于add,load + alu操作也可以,因此Intel CPU可以将add dword [rdi], 1解码为2个融合域uops. (相同的加载+添加微融合可将add eax, [rdi]解码为单个uop,因此任何简单"解码器都可以对其进行解码,而不仅仅是可以处理多uop指令的复杂"解码器.这减少了前端结束瓶颈.

这就是为什么add [mem], 1在Intel CPU上比inc [mem]更有效率的原因,即使inc reg的效率(但更小)却与add reg,1相同. (inc不能对它的load + inc进行微融合,这会导致标志设置与add不同). INC指令与添加1:有关系吗?

但这只是帮助前端更快地将其引入调度程序中.负载仍然必须与添加项分开运行.

但是微融合负载不必等待整个指令输入的其余部分准备就绪.考虑一个像add [rdi], eax这样的指令,其中RDI和EAX都是该指令的输入,但是直到ALU添加uop才需要EAX.一旦加载地址准备好,并且有一个空闲的加载执行单元(AGU +缓存访问),加载就可以执行.另请参见如何准确安排x86 uops?.


在解码uOp中读取寄存器,在内存uOp中存储/加载,我们允许我们在内存uOp中获取寄存器的值

所有当前的x86微体系结构都使用乱序执行和寄存器重命名(Tomasulo算法).指令被重命名并发布到核心的乱序部分(ROB和调度程序).

直到将指令从调度程序分派"到执行单元,才读取物理寄存器文件. (或用于最近生成的输入,从其他uops转发.)


独立指令可以与它们的执行重叠.例如,一个Skylake CPU可以维持每个时钟4个融合域/7个非融合域的吞吐量,包括2个负载+ 1个存储,如何快速实现缓存?.)

桑迪布里奇(Sandybridge)家庭的前端吞吐量是每个时钟4个融合域uops,并且它们的后端有很多执行单元来处理各种指令混合. (Haswell和更高版本具有4个整数ALU,2个加载端口,一个存储数据端口和一个专用的存储AGU,用于简单的存储寻址模式.因此,它们通常可以在执行高速缓存未命中停止执行后迅速追赶",从而使乱序窗口中找到更多工作要做.)

In the study of assembler and processor, one thing takes me out, how is done the instruction :

add mem, 1

In my head, the processor cannot load the memory value and process the arithmetic operation during the same instruction. So I figure it takes place like:

mov reg, mem
add reg, 1
mov mem, reg

If I consider a processor with a RISC Pipeline, we can observe some stalls. It's surprising for an instruction as simple as i++:

|  Fetch  | Decode  | Exec    | Memory  | WriteB  |
          |  Fetch  |         |         | Decode  | Exec    | Memory  | WriteB  |
                    |  Fetch  |         |         |         | Decode  | Exec    | Memory  | WriteB  |

(As I could read in Patterson's book Computer Architecture: A Quantative Approach, registers are read in Decode uOp, Store/Load in Memory uOp and we allow ourselves to take the value of a register at the Memory uOp.)

Am I right? or the modern processors have specific methods to do that more efficiently?

解决方案

You're right, a modern x86 will decode add dword [mem], 1 to 3 uops: a load, an ALU add, and a store.

Those 3 dependent operations can't happen at the same time because the later ones have to wait for the result of the earlier one.

But execution of independent instructions can overlap, and modern CPUs very aggressively look for and exploit "instruction level parallelism" to run your code faster than 1 uop per clock. See this answer for an intro to what a single CPU core can do in parallel, with links to more stuff, like Agner Fog's x86 microarch guide, and David Kanter's write-ups of Sandybridge and Bulldozer.


But if you look at Intel's P6 and Sandybridge microarchitecture families, a store is actually separate store-address and store-data uops. The store-address uop has no dependency on the load or ALU uop, and can write the store address into the store buffer at any time. (Intel's optimization manual calls it the Memory Order Buffer).

To increase front-end throughput, store-address and store-data uops can decode as a micro-fused pair. For add, so can the load+alu operation, so an Intel CPU can decode add dword [rdi], 1 to 2 fused-domain uops. (The same load+add micro-fusion works for decoding add eax, [rdi] to a single uop, so any of "simple" decoders can decode it, not just the "complex" decoder that can handle multi-uop instructions. This reduces front-end bottlenecks).

This is why add [mem], 1 is more efficient than inc [mem] on Intel CPUs, even though inc reg is just as efficient (but smaller) than add reg,1. (inc can't micro-fuse its load+inc, which sets flags differently than add). INC instruction vs ADD 1: Does it matter?

But this is just helping the front-end get uops into the scheduler more quickly; the load still has to run separately from the add.

But a micro-fused load doesn't have to wait for the rest of the whole instruction's inputs to be ready. Consider an instruction like add [rdi], eax where RDI and EAX are both inputs to the instruction, but EAX isn't needed until the ALU add uop. The load can execute as soon as the load-address is ready and there's a free load execution unit (AGU + cache access). See also How are x86 uops scheduled, exactly?.


registers are read in Decode uOp, Store/Load in Memory uOp and we allow ourselves to take the value of a register at the Memory uOp

All current x86 microarchitectures use out-of-order execution with register renaming (Tomasulo's algorithm). Instructions are renamed and issued into the out-of-order part of the core (ROB and scheduler).

The physical register file isn't read until an instruction is "dispatched" from the scheduler to an execution unit. (Or for recently-generated inputs, forwarded from other uops.)


Independent instructions can overlap their execution. For example, a Skylake CPU can sustain a throughput of 4 fused-domain / 7 unfused-domain uops per clock, including 2 loads + 1 store, in a carefully crafted loop:

.loop: ; HSW: 1.12c / iter. SKL: 1.0001c
    add edx, [rsp]           ; 1 fused-domain uop:  micro-fused load+add
    mov [rax], edi           : 1 fused-domain uop:  micro-fused store-address+store-data
    blsi ebx, [rdi]          : 1 fused-domain uop:  micro-fused load+bit-manip

    dec ecx
    jnz .loop                ; 1 fused-domain uop: macro-fused dec+branch runs on port 6

Sandybridge-family CPUs have an L1d cache capable of 2 reads + 1 write per clock. (Before Haswell, only 256-bit vectors could work around the AGU throughput limit, though. See How can cache be that fast?.)

Sandybridge-family front-end throughput is 4 fused-domain uops per clock, and they have lots of execution units in the back-end to handle various instruction mixes. (Haswell and later have 4 integer ALUs, 2 load ports, a store-data port, and a dedicated store-AGU for simple store addressing modes. So they can often "catch up" quickly after a cache-miss stalls execution, quickly making room in the out-of-order window to find more work to do.)

这篇关于处理器可以同时进行内存和算术运算吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆