性能统计输出的解释 [英] Interpretation of perf stat output

查看:46
本文介绍了性能统计输出的解释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开发了一个代码,该代码可以将大型2D图像(最多64MPixels)作为输入,并且:

I have developed a code that gets as input a large 2-D image (up to 64MPixels) and:

  • 在每行上应用一个过滤器
  • 转置图像(使用分块以避免大量缓存丢失)
  • 在图像的列(行)上应用过滤器
  • 将过滤后的图像转回以进行其他计算

尽管它并没有改变任何东西,但是为了完整起见,过滤是应用离散小波变换,并且代码是用C编写的.

Although it doesn't change something, for the sake of completeness of my question, the filtering is applying a discrete wavelet transform and the code is written in C.

我的最终目标是使此运行尽可能快.到目前为止,通过使用分块矩阵转置,向量化,多线程,对编译器友好的代码等,我的提速超过10倍.

My end goal is to make this run as fast as possible. The speedups I have so far are more than 10X times through the use of the blocking matrix transpose, vectorization, multithreading, compiler-friendly code etc.

出现我的问题:我拥有的代码的最新分析统计信息(使用 perf stat -e )困扰着我.

Coming to my question: The latest profiling stats of the code I have (using perf stat -e) have troubled me.

        76,321,873 cache-references                                            
     8,647,026,694 cycles                    #    0.000 GHz                    
     7,050,257,995 instructions              #    0.82  insns per cycle        
        49,739,417 cache-misses              #   65.171 % of all cache refs    

       0.910437338 seconds time elapsed

(高速缓存未命中数)/(#指令数)低至约0.7%.此处提到此数字是检查内存效率的好方法.

The (# of cache-misses)/(# instructions) is low at around ~0.7%. Here it is mentioned that this number is a good thing to have in mind to check for memory efficiency.

另一方面,高速缓存未命中相对于高速缓存引用的百分比非常高(65%!),我认为这可能表明在高速缓存效率方面执行出现问题.

On the other hand, the % of cache-misses to cache-references is significantly high (65%!) which as I see could indicates that something is going wrong with the execution in terms of cache efficiency.

来自 perf stat -d 的详细统计信息为:

The detailed stat from perf stat -d is:

   2711.191150 task-clock                #    2.978 CPUs utilized          
         1,421 context-switches          #    0.524 K/sec                  
            50 cpu-migrations            #    0.018 K/sec                  
       362,533 page-faults               #    0.134 M/sec                  
 8,518,897,738 cycles                    #    3.142 GHz                     [40.13%]
 6,089,067,266 stalled-cycles-frontend   #   71.48% frontend cycles idle    [39.76%]
 4,419,565,197 stalled-cycles-backend    #   51.88% backend  cycles idle    [39.37%]
 7,095,514,317 instructions              #    0.83  insns per cycle        
                                         #    0.86  stalled cycles per insn [49.66%]
   858,812,708 branches                  #  316.766 M/sec                   [49.77%]
     3,282,725 branch-misses             #    0.38% of all branches         [50.19%]
 1,899,797,603 L1-dcache-loads           #  700.724 M/sec                   [50.66%]
   153,927,756 L1-dcache-load-misses     #    8.10% of all L1-dcache hits   [50.94%]
    45,287,408 LLC-loads                 #   16.704 M/sec                   [40.70%]
    26,011,069 LLC-load-misses           #   57.44% of all LL-cache hits    [40.45%]

   0.910380914 seconds time elapsed

这里的前端和后端停滞周期也很高,较低级别的缓存似乎遭受了高达57.5%的未命中率.

Here frontend and backend stalled cycles are also high and the lower level caches seem to suffer from a high miss rate of 57.5%.

哪种指标最适合这种情况?我当时想过的一个想法是,可能的情况是,在初始映像加载后,工作负载不再需要进一步接触" LL缓存(一次又一次加载值)–工作负载比CPU负担更多内存绑定是一种图像过滤算法).

Which metric is the most appropriate for this scenario? One idea I was thinking is that it could be the case that the workload no longer requires further "touching" of the LL caches after the initial image load (loads the values once and after that it's done - the workload is more CPU-bound than memory-bound being an image filtering algorithm).

我在其上运行的计算机是Xeon E5-2680(20M的智能缓存,其中每个内核256KB L2缓存,即8个内核).

The machine I'm running this on is a Xeon E5-2680 (20M of Smart cache, out of which 256KB L2 cache per core, 8 cores).

推荐答案

您要确保的第一件事是计算机上没有正在运行的其他计算密集型进程.那是服务器CPU,所以我认为可能是个问题.

The first thing you want to make sure is that no other compute intensive process is running on your machine. That's a server CPU so I thought that could be a problem.

如果您在程序中使用多线程,并且在线程之间分配相等的工作量,则可能会仅在一个CPU上收集指标.

If you use multi-threading in your program, and you distribute equal amount of work between threads, you might be interested collecting metrics only on one CPU.

我建议在优化阶段禁用超线程,因为在解释性能分析指标时,这可能会引起混乱.(例如,增加了在后端花费的#cycles).另外,如果将工作分配给3个线程,则很有可能2个线程共享一个内核的资源,而第3个线程将拥有整个内核本身-而且速度会更快.

I suggest disabling hyper-threading in the optimization phase as it can lead to confusion when interpreting the profiling metrics. (e.g. increased #cycles spent in the back-end). Also if you distribute work to 3 threads, you have a high chance that 2 threads will share the resources of one core and the 3rd will have the entire core for itself - and it will be faster.

Perf从来都不擅长于解释指标.从数量级来看,高速缓存引用是命中LLC的L2丢失.如果LLC引用/#Instructions的数量较少,那么与LLC引用相比,高的LLC未命中数并不总是一件坏事.在您的情况下,您有0.018,这意味着您的大部分数据都在L2中使用.LLC未命中率高意味着您仍然需要从RAM中获取数据并将其写回.

Perf has never been very good at explaining the metrics. Judging by the order of magnitude, the cache references are the L2 misses that hit the LLC. A high LLC miss number compared with LLC references is not always a bad thing if the number of LLC references / #Instructions is low. In your case, you have 0.018 so that means that most of your data is being used from L2. The high LLC miss ratio means that you still need to get data from RAM and write it back.

关于#Cycles BE和FE约束,我有点担心这些值,因为它们的总和不是100%,也不是周期总数.您有8G周期,但FE停留6G周期,BE停留4G周期.这似乎不太正确.

Regarding #Cycles BE and FE bound, I'm a bit concerned about the values because they don't sum to 100% and to the total number of cycles. You have 8G but staying 6G cycles in the FE and 4G cycles in the BE. That does not seem very correct.

如果FE周期很高,则意味着您在指令高速缓存中未命中或错误的分支推测.如果BE周期很高,则意味着您要等待数据.

If the FE cycles is high, that means you have misses in the instruction cache or bad branch speculation. If the BE cycles is high, that means you wait for data.

无论如何,关于您的问题.评估代码性能最相关的指标是指令/周期(IPC).您的CPU最多每个周期可以执行4条指令.您只需执行0.8.这意味着资源未被充分利用,除非您有许多矢量指令.IPC之后,您需要检查分支未命中和L1未命中(数据和代码),因为它们会产生最多的惩罚.

Anyway, regarding your question. The most relevant metric to asses the performance of your code is Instructions / Cycle (IPC). Your CPU can execute up to 4 instructions / cycle. You only execute 0.8. That means resources are underutilized, except for the case where you have many vector instructions. After IPC you need to check branch misses and L1 misses (data and code) because those generate most penalties.

最后的建议:您可能对尝试使用英特尔的vTune放大器感兴趣.它为指标提供了更好的解释,并指出了代码中的最终问题.

A final suggestion: you may be interested in trying Intel's vTune Amplifier. It gives a much better explaining on the metrics and points you to the eventual problems in your code.

这篇关于性能统计输出的解释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆