性能统计输出的解释 [英] Interpretation of perf stat output

查看：46 发布时间：2021/4/21 18:35:23 performance caching optimization computer-architecture perf

本文介绍了性能统计输出的解释的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经开发了一个代码，该代码可以将大型2D图像(最多64MPixels)作为输入，并且:

I have developed a code that gets as input a large 2-D image (up to 64MPixels) and:

在每行上应用一个过滤器
转置图像(使用分块以避免大量缓存丢失)
在图像的列(行)上应用过滤器
将过滤后的图像转回以进行其他计算

尽管它并没有改变任何东西，但是为了完整起见，过滤是应用离散小波变换，并且代码是用C编写的.

Although it doesn't change something, for the sake of completeness of my question, the filtering is applying a discrete wavelet transform and the code is written in C.

我的最终目标是使此运行尽可能快.到目前为止，通过使用分块矩阵转置，向量化，多线程，对编译器友好的代码等，我的提速超过10倍.

My end goal is to make this run as fast as possible. The speedups I have so far are more than 10X times through the use of the blocking matrix transpose, vectorization, multithreading, compiler-friendly code etc.

出现我的问题:我拥有的代码的最新分析统计信息(使用 perf stat -e )困扰着我.

Coming to my question: The latest profiling stats of the code I have (using perf stat -e) have troubled me.

        76,321,873 cache-references                                            
     8,647,026,694 cycles                    #    0.000 GHz                    
     7,050,257,995 instructions              #    0.82  insns per cycle        
        49,739,417 cache-misses              #   65.171 % of all cache refs    

       0.910437338 seconds time elapsed

(高速缓存未命中数)/(#指令数)低至约0.7％.此处提到此数字是检查内存效率的好方法.

The (# of cache-misses)/(# instructions) is low at around ~0.7%. Here it is mentioned that this number is a good thing to have in mind to check for memory efficiency.

另一方面，高速缓存未命中相对于高速缓存引用的百分比非常高(65％！)，我认为这可能表明在高速缓存效率方面执行出现问题.

On the other hand, the % of cache-misses to cache-references is significantly high (65%!) which as I see could indicates that something is going wrong with the execution in terms of cache efficiency.

来自 perf stat -d 的详细统计信息为:

The detailed stat from perf stat -d is:

   2711.191150 task-clock                #    2.978 CPUs utilized          
         1,421 context-switches          #    0.524 K/sec                  
            50 cpu-migrations            #    0.018 K/sec                  
       362,533 page-faults               #    0.134 M/sec                  
 8,518,897,738 cycles                    #    3.142 GHz                     [40.13%]
 6,089,067,266 stalled-cycles-frontend   #   71.48% frontend cycles idle    [39.76%]
 4,419,565,197 stalled-cycles-backend    #   51.88% backend  cycles idle    [39.37%]
 7,095,514,317 instructions              #    0.83  insns per cycle        
                                         #    0.86  stalled cycles per insn [49.66%]
   858,812,708 branches                  #  316.766 M/sec                   [49.77%]
     3,282,725 branch-misses             #    0.38% of all branches         [50.19%]
 1,899,797,603 L1-dcache-loads           #  700.724 M/sec                   [50.66%]
   153,927,756 L1-dcache-load-misses     #    8.10% of all L1-dcache hits   [50.94%]
    45,287,408 LLC-loads                 #   16.704 M/sec                   [40.70%]
    26,011,069 LLC-load-misses           #   57.44% of all LL-cache hits    [40.45%]

   0.910380914 seconds time elapsed

这里的前端和后端停滞周期也很高，较低级别的缓存似乎遭受了高达57.5％的未命中率.

Here frontend and backend stalled cycles are also high and the lower level caches seem to suffer from a high miss rate of 57.5%.

哪种指标最适合这种情况?我当时想过的一个想法是，可能的情况是，在初始映像加载后，工作负载不再需要进一步接触" LL缓存(一次又一次加载值)–工作负载比CPU负担更多内存绑定是一种图像过滤算法).

Which metric is the most appropriate for this scenario? One idea I was thinking is that it could be the case that the workload no longer requires further "touching" of the LL caches after the initial image load (loads the values once and after that it's done - the workload is more CPU-bound than memory-bound being an image filtering algorithm).

我在其上运行的计算机是Xeon E5-2680(20M的智能缓存，其中每个内核256KB L2缓存，即8个内核).

The machine I'm running this on is a Xeon E5-2680 (20M of Smart cache, out of which 256KB L2 cache per core, 8 cores).

性能统计输出的解释 [英] Interpretation of perf stat output

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

性能统计输出的解释 [英] Interpretation of perf stat output

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭