“性能统计"结果中的停滞周期前端和停滞周期后端是什么? [英] What are stalled-cycles-frontend and stalled-cycles-backend in 'perf stat' result?

查看:184
本文介绍了“性能统计"结果中的停滞周期前端和停滞周期后端是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道perf stat结果中 stalled-cycles-frontend stalled-cycles-backend 是什么意思吗?我在互联网上搜索,但没有找到答案.谢谢

Does anybody know what is the meaning of stalled-cycles-frontend and stalled-cycles-backend in perf stat result ? I searched on the internet but did not find the answer. Thanks

$ sudo perf stat ls                     

Performance counter stats for 'ls':

      0.602144 task-clock                #    0.762 CPUs utilized          
             0 context-switches          #    0.000 K/sec                  
             0 CPU-migrations            #    0.000 K/sec                  
           236 page-faults               #    0.392 M/sec                  
        768956 cycles                    #    1.277 GHz                    
        962999 stalled-cycles-frontend   #  125.23% frontend cycles idle   
        634360 stalled-cycles-backend    #   82.50% backend  cycles idle
        890060 instructions              #    1.16  insns per cycle        
                                         #    1.08  stalled cycles per insn
        179378 branches                  #  297.899 M/sec                  
          9362 branch-misses             #    5.22% of all branches         [48.33%]

   0.000790562 seconds time elapsed

推荐答案

理论:

让我们从这里开始:当今的CPU是超标量的,这意味着它们每个周期(IPC)可以执行多个指令.最新的英特尔架构最多可以支持4个IPC(4个x86指令解码器).让我们不要将宏/微观融合引入讨论,以使事情更加复杂:).

Let's start from this: nowaday's CPU's are superscalar, which means that they can execute more than one instruction per cycle (IPC). Latest Intel architectures can go up to 4 IPC (4 x86 instruction decoders). Let's not bring macro / micro fusion into discussion to complicate things more :).

通常,由于各种资源争用,工作负载不会达到IPC = 4.这意味着 CPU正在浪费周期(指令数量由软件提供,CPU必须在尽可能短的周期内执行它们).

Typically, workloads do not reach IPC=4 due to various resource contentions. This means that the CPU is wasting cycles (number of instructions is given by the software and the CPU has to execute them in as few cycles as possible).

我们可以将CPU花费的总周期划分为3类:

We can divide the total cycles being spent by the CPU in 3 categories:

  1. 退回指令的周期(有用的工作)
  2. 在后端花费的周期(已浪费)
  3. 花在前端的周期(已浪费).

要使IPC为4,要退出的周期数必须接近周期总数.请记住,在此阶段,所有微操作(uOps)都会从管道中退出,并将其结果提交到寄存器/缓存中.在此阶段,您可以退出4 uOps以上,因为此数字是由执行端口的数量给定的.如果您只有25%的周期要退出4 uOps,那么您的总体IPC将为1.

To get an IPC of 4, the number of cycles retiring has to be close to the total number of cycles. Keep in mind that in this stage, all the micro-operations (uOps) retire from the pipeline and commit their results into registers / caches. At this stage you can have even more than 4 uOps retiring, because this number is given by the number of execution ports. If you have only 25% of the cycles retiring 4 uOps then you will have an overall IPC of 1.

在后端停滞的周期是浪费的,因为CPU必须等待资源(通常是内存)或完成长时间等待的指令(例如,先验的-sqrt,倒数,除法等) ).

The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.).

停在前端的循环是浪费的,因为这意味着前端不会通过微操作为后端提供数据.这可能意味着您在指令高速缓存中有未命中的内容,或者在微操作高速缓存中尚未解码的复杂指令.即时编译的代码通常表示这种行为.

The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.

另一个停顿原因是分支预测未命中.那就是所谓的不良猜测.在这种情况下,会发出uOps,但由于BP预测错误而将其丢弃.

Another stall reason is branch prediction miss. That is called bad speculation. In that case uOps are issued but they are discarded because the BP predicted wrong.

探查器中的实现

您如何解释BE和FE停止周期?

How do you interpret the BE and FE stalled cycles?

不同的探查器在这些指标上有不同的方法.在vTune中,类别1至3相加得出100%的周期.这样做合理,因为要么您的CPU停滞了(没有uOps退出),要么它执行了有用的工作(uOps)退出了.在此处查看更多信息: https://software .intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm

Different profilers have different approaches on these metrics. In vTune, categories 1 to 3 add up to give 100% of the cycles. That seams reasonable because either you have your CPU stalled (no uOps are retiring) either it performs usefull work (uOps) retiring. See more here: https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm

在性能方面通常不会发生.这是一个问题,因为当您看到 125%的周期停滞在前端时,您不知道如何真正地解释这一点.您可以将> 1指标与4个解码器的事实联系起来,但是如果继续推理,则IPC将不匹配.

In perf this usually does not happen. That's a problem because when you see 125% cycles stalled in the front end, you don't know how to really interpret this. You could link the >1 metric with the fact that there are 4 decoders but if you continue the reasoning, then the IPC won't match.

更好的是,您不知道问题有多严重.什么占125%?那么#cycles是什么意思?

Even better, you don't know how big the problem is. 125% out of what? What do the #cycles mean then?

我个人对perf的BE和FE停滞周期有些怀疑,希望这种情况能得到解决.

I personally look a bit suspicious on perf's BE and FE stalled cycles and hope this will get fixed.

我们可能会通过从以下位置调试代码来获得最终答案:

Probably we will get the final answer by debugging the code from here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c

这篇关于“性能统计"结果中的停滞周期前端和停滞周期后端是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆