我应该如何在笔记本电脑的CPU中查找流水线级数 [英] How should I approach to find number of pipeline stages in my Laptop's CPU

查看:195
本文介绍了我应该如何在笔记本电脑的CPU中查找流水线级数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想研究最新的处理器与标准RISC V实现(RISC V具有5级流水线-获取,解码,内存,ALU和回写)有何不同,但是找不到我应该如何着手解决这个问题,因此在处理器上找到流水线的当前实现方式

I want to look into how latest processors differs from standard RISC V implementation (RISC V having 5 stage pipeline - fetch, decode, memory , ALU , Write back) but not able to find how should I start approaching the problem so as to find the current implementation of pipelining at processor

我尝试将Intel文档用于i7-4510U文档,但这并没有太大帮助

I tried referring Intel documentation for i7-4510U documentation but it was not much help

推荐答案

据报道,Haswell的流水线长度为14个阶段(在uop-cache命中时),从L1i提取以进行传统解码时为19个阶段.找到它的唯一可行方法是从有关该微体系结构的文章中查找它.您无法精确测量它.

Haswell's pipeline length is reportedly 14 stages (on a uop-cache hit), 19 stages when fetching from L1i for legacy decode. The only viable approach for finding it is to look it up from articles about that microarchitecture. You can't exactly measure it.

我们对英特尔和AMD CPU内部的很多了解都是基于供应商在芯片会议上的介绍,他们的优化手册和他们的专利.基准,但它与分支预测错误的罚款有关.请注意,流水线执行单元每个都有自己的流水线,而内存流水线也有点独立.

A lot of what we know about Intel and AMD CPU internals is based on presentations at chip conferences by the vendors, their optimization manuals, and their patents. You can't truly measure it with a benchmark, but it's related to the branch mispredict penalty. Note that pipelined execution units each have their own pipelines, and the memory pipeline is also kinda separate.

您的CPU核心是Intel的Haswell微体系结构.请参阅David Kanter对其内部的深入研究: https://www.realworldtech.com/haswell-cpu/.

Your CPU's cores are Intel's Haswell microarchitecture. See David Kanter's deep dive on its internals: https://www.realworldtech.com/haswell-cpu/.

这是超标量的exec设计,而不是像经典RISC这样的简单有序设计您正在考虑的. 要求的背景知识:现代微处理器 《 90分钟指南》!涵盖了cpu架构从简单的非流水线型到流水线型,超标量和无序执行的演变.

It's a superscalar out-of-order exec design, not a simple in-order like a classic RISC that you're thinking of. Required background reading: Modern Microprocessors A 90-Minute Guide! covers the evolution of cpu architecture from simple non-pipelined, to pipelined, superscalar, and out-of-order execution.

它在某些流水线阶段之间有相当大的缓冲区,而不仅仅是一个简单的锁存器.它的分支预测效果很好,通常通过缓冲多个字节的机器代码来隐藏获取气泡通常更为有用.在任何地方都没有停顿的情况下,问题/重命名阶段是管道中最窄的点,因此阶段之间的前端缓冲区将趋于填充. (据报道,在哈斯韦尔,每个时钟的uop缓存提取也只有4oups.Skylake将其扩展到6,最多可将整个uop缓存行读入IDQ.)

It has sizeable buffers between some pipeline stages, not just a simple latch; its branch prediction works so well that it's usually more useful for it to hide fetch bubbles by buffering multiple bytes of machine code. With no stalls anywhere, the issue/rename stage is the narrowest point in the pipeline, so front-end buffers between stages will tend to fill up. (In Haswell, uop-cache fetch is reportedly only 4 uops per clock, too. Skylake widened that to 6, up to a whole uop cache line read into the IDQ.)

https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(客户端) 将管道长度报告为"14-19",阶段,从uop缓存获取或从L1i缓存获取进行计数. (Uop缓存命中会缩短管道的有效长度,从而减少解码.) https://www.anandtech.com/show/6355/intels-haswell-architecture/6 表示相同的话.

https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client) reports the pipeline length as "14-19" stages, which counts from uop-cache fetch or from L1i cache fetch. (Uop cache hits shorten the effective length of the pipeline, cutting out decode.) https://www.anandtech.com/show/6355/intels-haswell-architecture/6 says the same thing.

https://www.7-cpu.com/cpu/Haswell. html 测量了uop缓存命中在15.0周期处的误判损失,对于uop缓存未命中(L1i缓存命中)在18-20周期处进行了误判.这与管道的 part 的长度有关.

Also https://www.7-cpu.com/cpu/Haswell.html measured the mispredict penalty at 15.0 cycle for a uop cache hit, 18-20 cycles for a uop-cache miss (L1i cache hit). That's correlated to the length of part of the pipeline.

请注意,后端中的实际执行单元每个都有自己的管道,例如端口0和1上的AVX FMA单元的长度均为5级. (Haswell上的vmulps/vfma...ps延迟为5个周期.)我不知道整个管道的14-19个周期长度是否将执行计为1个周期,因为典型的整数ALU指令(例如add仅具有1个周期的延迟. (和4/时钟吞吐量.)较慢的整数ALU指令(如imulpopcntbsf)只能在端口1上执行,它们具有3个周期的延迟.

Note that the actual execution units in the back-end each have their own pipeline, e.g. the AVX FMA units on ports 0 and 1 are each 5 stages long. (vmulps / vfma...ps latency of 5 cycles on Haswell.) I don't know whether that 14 - 19 cycle length of the whole pipeline is counting execution as 1 cycle, because typical integer ALU instructions like add have only 1 cycle latency. (And 4/clock throughput.) Slower integer ALU instructions like imul, popcnt, and bsf can only execute on port 1, where they have 3 cycle latency.

存储缓冲区还将存储对L1d缓存的提交与存储指令的执行完全脱钩.如果存储缓冲区中充满了一堆退休的高速缓存未命中的存储区,则这可能会对中断延迟产生影响.从ROB退休后,就不能将其丢弃,而必须肯定地发生.因此,它们将阻止由中断处理程序完成的任何存储提交,直到耗尽为止.或阻止任何序列化指令(包括iret)退出; x86序列化"指令被定义为清空整个管道.

The store buffer also entirely decouples store commit to L1d cache from execution of store instructions. This can have an impact on interrupt latency if the store buffer is full of a bunch of retired cache-miss stores. Being retired from the ROB, they can't be discarded, and have to definitely happen. So they'll block any store done by the interrupt handler from committing until they drain. Or block any serializing instruction (including iret) from retiring; x86 "serializing" instructions are defined as emptying the whole pipeline.

Haswell的存储缓冲区大为42个条目,并且假设没有高速缓存未命中,可以以1/时钟的速度提交到L1d高速缓存.或更多其他与高速缓存未命中有关的内容.当然,存储缓冲区不是流水线" ,物理上它很可能是由某些试图将头提交到L1d高速缓存的逻辑读取的循环缓冲区.此逻辑与存储执行单元(将地址和数据写入到存储缓冲区)完全分开.因此,存储缓冲器的大小影响到排空管线"所需的时间.从一般意义上讲,但从获取到报废的连接阶段的流水线并不是真的.

Haswell's store buffer is 42 entries large, and can commit to L1d cache at 1/clock assuming no cache misses. Or many more with cache misses. Of course, the store buffer isn't a "pipeline", physical it's likely a circular buffer that's read by some logic that tries to commit the head to L1d cache. This logic is fully separate from the store execution units (which write address and data into the store buffer). So the size of the store buffer affects how long it can take to drain "the pipeline" in a general sense, but in terms of a pipeline of connected stages from fetch to retirement it's not really that.

即使是乱序的执行后端也可能有很长的依赖链在运行中,这将需要很长的等待时间.例如一连串sqrtsd指令可能是您排队的最慢的东西. (每个uop的最大延迟时间).例如就像在 Meltdown攻击示例在故障之后需要为投机执行创建一个长长的阴影. **因此,耗尽后端的时间可能比流水线长度" 长很多.(但与存储缓冲区不同,可以简单地丢弃这些微指令在中断时恢复到一致的退出状态.)

Even the out-of-order execution back end can have a very long dependency chain in flight that would take a long time to wait for. e.g. a chain of sqrtsd instructions might be the slowest thing you could queue up. (Max latency per uop). e.g. like in this Meltdown exploit example that needs to create a long shadow for speculative execution after a fault. **So the time to drain the back-end can be much longer than the "pipeline length". (But unlike the store buffer, these uops can simply be discarded on an interrupt, rolling back to the consistent retirement state.)

(也与长的dep链有关:了解带长对带有两个长依赖链的循环的影响,以增加长度)

(Also related to long dep chains: Are loads and stores the only instructions that gets reordered? and Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths)

管道长度并不是真正有意义的.与流水线长度相关的与性能相关的特性是分支预测错误的代价.请参阅当Skylake CPU错误地预测分支时会发生什么? . (而且我想这也是I缓存未命中代价的一部分;数据从核外到达多长时间后,后端就可以开始执行任何操作.)由于无序执行和快速恢复,有时分支错误预测的代价可能是与缓慢的实际工作"部分重叠在后端. 通过计算条件提早避免管道拖延

Pipeline length is not really directly meaningful. The performance-relevant characteristic that's related to pipeline length is the branch mispredict penalty. See What exactly happens when a skylake CPU mispredicts a branch?. (And I guess also part of the I-cache miss penalty; how long after data arrives from off-core can the back end start executing anything.) Thanks to out-of-order execution and fast recovery, branch misprediction penalty can sometimes be partly overlapped with slow "real work" in the back-end. Avoid stalling pipeline by calculating conditional early

人们通常试图实际测量的是分支错误预测惩罚.如果您感到好奇,可以打开 https://www.7-cpu.com/来源.您可以查看他们的测试代码.

What people generally try to actually measure is branch mispredict penalty. If you're curious, https://www.7-cpu.com/ is open-source. You could have a look at their code for testing.

lfence耗尽无序后端的开销仅是流水线的长度之外,还有未知数量的开销,因此您不能仅仅使用它.您可以制作一个很大的仅靠背的lfence块来测量文件吞吐量,但是在文件之间没有任何东西,则每4.0个周期大约得到1个.我猜是因为它不必序列化已经按顺序排列的前端. https://www.uops.info/table.html .

lfence to drain the out-of-order back-end has unknown amounts of overhead beyond just the length of the pipeline, so you can't just use that. You could make a big block of just back-to-back lfence to measure lfence throughput, but with nothing between lfences we get about 1 per 4.0 cycles; I guess because it doesn't have to serialize the front-end which is already in-order. https://www.uops.info/table.html.

rdtsc本身非常慢,这使得编写微基准测试成为一个额外的挑战.通常,您必须将内容放入循环或展开的块中并运行多次,以使计时开销可以忽略不计.

And rdtsc itself is pretty slow, which makes writing microbenchmarks an extra challenge. Often you have to put stuff in a loop or unrolled block and run it many times so timing overhead becomes negligible.

标准RISC-V实现包括未流水线内核,2级,3级和5级流水线内核以及乱序实现. ( https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf ).

The standard RISC-V implementations include an unpipelined core, and 2, 3, and 5-stage pipelined cores, and an out-of-order implementation. (https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf).

没有必须实施为经典5阶段RISC ,尽管这将使其非常类似于经典的MIPS,并且在教授CPU体系结构和流水线方面很正常.

It doesn't have to be implemented as a classic 5-stage RISC, although that would make it very much like classic MIPS and would be normal for teaching CPU-architecture and pipelining.

请注意,经典RISC管道(具有1个内存阶段,并且地址计算在EX中完成)需要1个周期的L1d访问延迟,因此,它不太适合具有高时钟和大缓存的现代高性能设计.例如Haswell的L1d加载延迟为4或5个周期. (请参阅是否在那里当base + offset与base位于不同的页面时会产生惩罚吗?有关4周期特殊情况快捷方式的更多信息,它会猜测最终地址以与地址生成并行地启动TLB查找.)

Note that the classic-RISC pipeline (with 1 mem stage, and address calculation done in EX) requires an L1d access latency of 1 cycle, so that's not a great fit for modern high-performance designs with high clocks and large caches. e.g. Haswell has L1d load latency of 4 or 5 cycles. (See Is there a penalty when base+offset is in a different page than the base? for more about the 4-cycle special case shortcut where it guesses the final address to start TLB lookup in parallel with address-generation.)

这篇关于我应该如何在笔记本电脑的CPU中查找流水线级数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆