多次访问主存储器和无序执行 [英] Multiple accesses to main memory and out-of-order execution

查看:101
本文介绍了多次访问主存储器和无序执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们假设我有两个指针指向未缓存的无关地址,因此在被取消引用时,它们都必须从主内存中移出。

  int load_and_add(int * pA,int * pB)
{
int a = * pA; //最有可能错过高速缓存
int b = * pB; //很有可能会在高速缓存中丢失

// ...某些未使用a或b的代码

int c = a + b;
返回c;
}

如果乱序执行允许在 c 是如何计算的,如何获取值 a b 继续使用现代的英特尔处理器吗? ?



换句话说,如果我们假设访问主内存需要300个周期。获取 a b 会花费600个周期,还是无序执行会导致某些可能的重叠并且可能会降低成本周期?

解决方案

现代CPU具有多个负载缓冲区,因此可以同时处理多个负载。内存子系统具有大量流水线,因此其许多部分的吞吐量都比延迟好得多。 (例如,通过预取,Haswell可以每1个时钟(从主存储器)维持8B的负载。但是,如果提前不知道地址的延迟时间为数百个周期。)



所以是的,一个Haswell内核可以跟踪多达72个未完成的负载,以等待来自缓存/内存的数据。 (这是每核。共享的L3缓存还需要一些缓冲区来处理整个系统对DRAM和内存映射IO的负载/存储。)




Let us assume that I have two pointers that are pointing to unrelated addresses that are not cached, so they will both have to come all the way from main memory when being dereferenced.

int load_and_add(int *pA, int *pB)
{
    int a = *pA;   // will most likely miss in cache
    int b = *pB;   // will most likely miss in cache 

    // ...  some code that does not use a or b

    int c = a + b;
    return c;
}

If out-of-order execution allows executing the code before the value of c is computed, how will the fetching of values a and b proceed on a modern Intel processor?

Are the potentially-pipelined memory accesses completely serialized or may there be some sort of fetch overlapping performed by the CPU's memory controller?

In other words, if we assume that hitting main memory costs 300 cycles. Will fetching a and b cost 600 cycles or does out-of-order execution enable some possible overlap and perhaps cost less cycles?

解决方案

Modern CPUs have multiple load buffers so multiple loads can be outstanding at the same time. The memory subsystem is heavily pipelined, giving many parts of it much better throughput than latency. (e.g. with prefetching, Haswell can sustain (from main memory) an 8B load every 1 clock. But the latency if the address isn't known ahead of time is in the hundreds of cycles).

So yes, a Haswell core can keep track of up to 72 outstanding load uops waiting for data from cache / memory. (This is per-core. The shared L3 cache also needs some buffers to handle the whole system's loads / stores to DRAM and memory-mapped IO.)

Haswell's ReOrder Buffer size is 192 uops, so up to 190 uops of work in the code that does not use a or b can be issued and executed while the loads of a and b are the oldest instructions that haven't retired. Instructions / uops are retired in-order to support precise exceptions. The ROB size is basically the limit of the out-of-order window for hiding latency of slow operations like cache-misses.

Also see other links at the tag wiki to learn how CPUs work. Agner Fog's microarch guide is great for having a mental model of the CPU pipeline to let you understand approximately how code will execute.

From David Kanter's Haswell writeup:

这篇关于多次访问主存储器和无序执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆