L2指令提取未命中率远高于L1指令提取未命中率 [英] L2 instruction fetch misses much higher than L1 instruction fetch misses

查看:101
本文介绍了L2指令提取未命中率远高于L1指令提取未命中率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在生成一个综合C基准测试,旨在通过以下Python脚本引起大量的指令提取未命中:

I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:

#!/usr/bin/env python
import tempfile
import random
import sys

if __name__ == '__main__':
    functions = list()

    for i in range(10000):
        func_name = "f_{}".format(next(tempfile._get_candidate_names()))
        sys.stdout.write("void {}() {{\n".format(func_name))
        sys.stdout.write("    double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n")
        sys.stdout.write("    res = pi*r*r*h;\n")
        sys.stdout.write("    res = res/(e*e);\n")
        sys.stdout.write("}\n")
        functions.append(func_name)


    sys.stdout.write("int main() {\n")
    sys.stdout.write("unsigned int i;\n")
    sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n")
    for i in range(10000):
        r = random.randint(0, len(functions)-1)
        sys.stdout.write("{}();\n".format(functions[r]))


    sys.stdout.write("}\n")
    sys.stdout.write("}\n")

代码的作用只是生成一个 C程序,该程序由许多随机命名的伪函数组成,这些伪函数依次在 main()中以随机顺序调用.我正在使用 -O0 在CentOS 7下使用gcc 4.8.5编译生成的代码.该代码在装有2个Intel Xeon E5-2630v3(Haswell架构)的双插槽计算机上运行.

What the code does is simply generating a C program that consists of a lot of randomly named dummy functions that are in turn called in random order in main(). I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).

我感兴趣的是,在分析从C代码编译的二进制文件(不是Python脚本,仅用于自动生成代码)时,了解perf报告的与指令相关的计数器.特别是,我使用 perf stat 观察以下计数器:

What I am interested in is understanding instruction-related counters reported by perf when profiling the binary compiled from the C code (not the Python script, that is only used to automatically generate the code). In particular, I am observing the following counters with perf stat:

  • 说明
  • L1-icache-load-misses (获取缺少L1的指令,也就是Haswell上的r0280)
  • r2424 L2_RQSTS.CODE_RD_MISS (获取缺少L2的指令)
  • rf824 L2_RQSTS.ALL_PF (所有L2硬件预取器请求,包括代码和数据)
  • instructions
  • L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
  • r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
  • rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)

我首先对代码进行了分析,并在BIOS中禁用了所有硬件预取器,即

I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.

  • MLC流媒体已禁用
  • MLC空间预取器已禁用
  • DCU数据预取器已禁用
  • DCU指令预取器已禁用

,结果如下(进程固定在第二个CPU的第一个核心和相应的NUMA域上,但是我想这并没有多大区别):

and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):

perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code   

 Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':    

    25,108,610,204      instructions                                               
     2,613,075,664      L1-icache-load-misses                                       
     5,065,167,059      r2424                                                       
                17      rf824                                                       

      33.696954142 seconds time elapsed 

考虑到以上数字,我无法解释L2中如此大量的指令提取未命中.我已禁用所有预取器,并且 L2_RQSTS.ALL_PF 确认是这样.但是,为什么我看到L2中的指令提取未命中次数是L1i中的两倍?在我的(简单的)思维处理器模型中,如果在L2中查找一条指令,那么之前一定一定已经在L1i中查找了一条指令.显然我错了,我想念什么?

Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?

然后我尝试在启用所有硬件预取器的情况下运行相同的代码,即

I then tried to run the same code with all the hardware prefetchers enabled, i.e.

  • 已启用MLC Streamer
  • 已启用"MLC空间预取器"
  • 已启用DCU数据预取器
  • 已启用DCU指令预取器

结果如下:

perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code

 Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code':    

    25,109,877,626      instructions                                               
     2,599,883,072      L1-icache-load-misses                                       
     5,054,883,231      r2424                                                       
           908,494      rf824

现在, L2_RQSTS.ALL_PF 似乎表明正在发生更多的事情,尽管我希望预取器更具攻击性,但我认为指令预取器由于以下原因而受到了严峻的考验:跳跃密集型工作负载和数据预取器与这种工作负载没有多大关系.但同样,启用预取器后, L2_RQSTS.CODE_RD_MISS 仍然过高.

Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.

总而言之,我的问题是:

So, to sum up, my question is:

在禁用硬件预取器的情况下, L2_RQSTS.CODE_RD_MISS 似乎比 L1-icache-load-misses 高得多.即使启用了硬件预取器,我仍然无法解释. L2_RQSTS.CODE_RD_MISS L1-icache-load-misses 少的原因是什么?

With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?

推荐答案

指令预取器可以生成的请求不算作对L1I高速缓存的访问,而是算作在更高编号的内存级别的代码获取请求,例如L2.在所有带有指令预取器的英特尔微体系结构上通常都是这样. L2_RQSTS.CODE_RD_MISS 对来自L1I的请求和预取请求进行计数.需求请求是由IFU中的多路复用单元生成的,该单元从可能改变流的流水线中的不同单元(例如分支预测单元)中选择目标提取线性地址.如果可能,L1I指令预取器会在L1I未命中时产生预取请求.

The instruction prefetcher can generate requests are that don't count as accesses to the L1I cache, but are counted as code fetch requests at higher-numbered memory levels, such as the L2. This is generally true on all Intel microarchitectures with an instruction prefetcher. L2_RQSTS.CODE_RD_MISS counts both demand and prefetch requests from the L1I. Demand requests are generated by a multiplexing unit in the IFU that chooses a target fetch linear address from among the different units in the pipeline that may change the flow, such as the branch prediction units. Prefetch requests are generated by the L1I instruction prefetcher on an L1I miss if possible.

通常,预取获取请求的数量几乎与L1I未命中的数量成正比.对于从可缓存内存类型的内存区域中提取指令,以下公式成立:

In general, the number of prefetch fetch requests is nearly proportional to the number of L1I misses. For instruction fetches from memory regions of cacheable memory types, the following formula holds:

ICACHE.MISSES < = L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT

ICACHE.MISSES <= L2_RQSTS.CODE_RD_MISS + L2_RQSTS.CODE_RD_HIT

我不确定该公式是否也适用于不可缓存的提取请求.我没有在那种情况下测试它.我知道这些请求被计为 ICACHE.MISSES ,但不确定其他事件.

I'm not sure whether this formula also holds for uncacheable fetch requests. I didn't test it in that condition. I know these requests are counted as ICACHE.MISSES, but not sure about the other events.

在您的情况下,大多数指令提取将在L1I和L2中丢失.您有10,000个功能,每个功能几乎完全覆盖了2个64字节的缓存行(此处是具有以下功能的版本(只有两个函数),因此代码大小比Haswell上可用的256 KiB L2大得多.这些函数以非顺序且可上调的顺序被调用,因此L1I和L2预取器不会有很大帮助.唯一值得注意的例外是返回,使用RSB机制可以正确预测所有返回.

In your case, most instruction fetches will miss in the L1I and L2. You have 10,000 functions each nearly fully spans 2 64-btye cache lines (here is a version with only two functions), so the code size is much larger than the 256 KiB L2 available on Haswell. The functions are being called in a non-sequential and upredictable order, so the L1I and L2 prefetchers won't significantly help. The only noteworthy exception are returns, all of which will be predicted correctly using the RSB mechanism.

10,000个功能中的每个功能都被循环调用100,000次.大多数获取请求都是针对这些功能占用的行.每个函数有用的指令提取请求的总数约为2行* 10,000个函数* 100,000次迭代= 2,000,000,000行,其中大多数将在L1I和L2中丢失(但可能在第一次冷迭代后在L3中命中).数以百万计的其他请求将请求循环主体占用的行.您的测量结果表明,L1I中遗漏的指令提取量大约多30%.这是因为分支预测错误,导致对错误行的获取请求,这些错误行甚至可能不在L1I和/或L2中.每个L1I丢失都可能触发预取,因此正常情况下,L2指令的获取在L1I未命中次数的两倍之内.这与您的电话号码一致.

Each of the 10,000 functions are being called 100,000 times in a loop. Most fetch requests are for lines occupied by these functions. The total number of useful instruction fetch requests is about 2 lines per function * 10,000 function * 100,000 iterations = 2,000,000,000 lines, most of which will miss in the L1I and L2 (but probably hit in the L3 after the first cold iteration). Several millions of other requests will be for lines occupied by the loop body. Your measurements show that there are about 30% more instruction fetches that miss in the L1I. This is because of branch mispredictions, which cause fetch requests for incorrect lines that may not be even be in the L1I and/or L2. Each L1I miss may trigger a prefetch, so it's normal for L2 instruction fetches to be within two times the number of L1I misses. This is consistent with your numbers.

在我的两个函数版本中,我要为每个调用的函数计算24条指令,因此,我预计退休指令的总数约为240亿,但您获得了250亿.我不知道如何计数,或者由于某种原因每个功能有25条指令.

In my two-function version, I'm counting 24 instructions per invoked function, so I expect the total number of retired instructions to be approximately 24 billion, but you got 25 billion. Either I don't know how to count, or you have 25 instructions per function for some reason.

这篇关于L2指令提取未命中率远高于L1指令提取未命中率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆