为什么cmp指令花费太多时间? [英] Why does cmp instruction cost too much time?

查看:102
本文介绍了为什么cmp指令花费太多时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 libunwind(使用 linux perf)进行配置文件,perf top 监视目标进程,我得到了这个组装时间成本屏幕:

I am trying to do profile with libunwind (using linux perf), with perf top monitoring the target process, I get this assembly time cost screen:

  0.19 │       mov    %rcx,0x18(%rsp)                                                                                                                                    ▒
       │     trace_lookup():                                                                                                                                             ▒
  1.54 │       mov    0x8(%r9),%rcx                                                                                                                                      ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  0.52 │       and    $0x1,%edx                                                                                                                                          ◆
  0.57 │       mov    %r14d,0xc(%rsp)                                                                                                                                    ▒
  0.40 │       mov    0x78(%rsp),%r10                                                                                                                                    ▒
  1.24 │       sub    %rdx,%r15                                                                                                                                          ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.35 │       shl    %cl,%r12d                                                                                                                                          ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  2.18 │       mov    0x90(%rsp),%r8                                                                                                                                     ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.46 │       imul   %r15,%r13                                                                                                                                          ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  0.59 │       mov    %r15,0x88(%rsp)                                                                                                                                    ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.50 │       lea    -0x1(%r12),%rdx                                                                                                                                    ▒
  1.22 │       shr    $0x2b,%r13                                                                                                                                         ▒
  0.37 │       and    %r13,%rdx                                                                                                                                          ▒
  0.57 │177:   mov    %rdx,%rbp                                                                                                                                          ▒
  0.43 │       shl    $0x4,%rbp                                                                                                                                          ▒
  1.33 │       add    %rdi,%rbp                                                                                                                                          ▒
  0.49 │       mov    0x0(%rbp),%rsi                                                                                                                                     ▒
 24.40 │       cmp    %rsi,%r15                                                                                                                                          ▒
       │     ↓ jne    420                                                                                                                                                ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  2.10 │18e:   movzbl 0x8(%rbp),%edx                                                                                                                                     ▒
  3.68 │       test   $0x8,%dl                                                                                                                                           ▒
       │     ↓ jne    370                                                                                                                                                ▒
  1.27 │       mov    %edx,%eax                                                                                                                                          ▒
  0.06 │       shl    $0x5,%eax                                                                                                                                          ▒
  0.73 │       sar    $0x5,%al                                                                                                                                           ▒
  1.70 │       cmp    $0xfe,%al                                                                                                                                          ▒
       │     ↓ je     380                                                                                                                                                ▒
  0.01 │     ↓ jle    2f0                                                                                                                                                ▒
  0.01 │       cmp    $0xff,%al                                                                                                                                          ▒
       │     ↓ je     3a0                                                                                                                                                ▒
  0.02 │       cmp    $0x1,%al                                                                                                                                           ▒
       │     ↓ jne    298                                                                                                                                                ▒
  0.01 │       and    $0x10,%edx                                                                                                                                         ▒
       │       movl   $0x1,0x10(%rsp)                                                                                                                                    ▒
       │       movl   $0x1,0x1c8(%rbx)                                                                                                                                   ▒
  0.00 │     ↓ je     393                             

对应的源码在这里trace_lookup源码,如果我没看错的话,这条热路径cmp指令对应的代码行数是第296行,但不知道为什么这一行这么慢,花费了大部分时间?

The corresponding source code is here trace_lookup source code, If I read correctly, the number of lines of code corresponding to this hot path cmp instruction is line 296, but I don't know why this line is so slow and cost most of the time?

推荐答案

Command cmp %rsi,%r15 被标记为具有巨大开销,因为它等待数据从缓存或内存加载mov 0x0(%rbp),%rsi 命令.该命令可能存在 L1 甚至 L2 缓存未命中.

Command cmp %rsi,%r15 is marked as having huge overhead because it waits for data to be loaded from cache or memory by mov 0x0(%rbp),%rsi command. There is probably L1 or even L2 cache miss on that command.

对于代码片段

       │     trace_lookup():
  0.50 │       lea    -0x1(%r12),%rdx
  1.22 │       shr    $0x2b,%r13     
  0.37 │       and    %r13,%rdx      
  0.57 │177:   mov    %rdx,%rbp      
  0.43 │       shl    $0x4,%rbp      
  1.33 │       add    %rdi,%rbp      
  0.49 │       mov    0x0(%rbp),%rsi 
 24.40 │       cmp    %rsi,%r15      
       │     ↓ jne    420    

您有 24% 的当前函数分析事件占 cmp 指令.采样分析的默认事件是cycles"(CPU时钟周期的硬件事件)或cpu-clock"(线性时间的软件事件).因此,大约 24% 的采样中断确实中断了此功能,是针对此 cmp 命令的指令地址报告的.分析和现代乱序 CPU 可能存在系统性偏差,此时报告的成本不是针对运行缓慢的命令,而是针对未快速完成执行(退出)的命令.如果 %rsi 寄存器值不等于 %r15 寄存器值,则此 cmp+jne 命令对(融合 uop)将更改程序的指令流.在古代,这样的命令应该只是读取两个寄存器并比较它们的值,这很快并且不会占用函数执行时间的 1/4.但是现代 CPU 寄存器不仅仅是 32 位或 64 位存储值的地方,它们还有一些用于乱序引擎的隐藏标志(或重命名技术).在您的示例中,有 mov 0x0(%rbp),%rsi 确实更改了 %rsi 寄存器.该命令通过地址 *%rbp 从内存加载.CPU 确实开始将此加载到缓存/内存子系统中,并将 %rsi 寄存器标记为从内存加载挂起",继续执行指令.有可能下一条指令不需要该加载的结果(这需要一些时间,例如 Haswell:L1 命中需要 4 个 cpu 周期,L2 命中需要 12 个 CPU 周期,L3 命中需要 36-66 个 CPU 周期,缓存未命中和 RAM 读取需要额外 50-100 ns).但是在您的情况下,下一条指令是 cmp+jne 并从 %rsi 读取,并且在将内存中的数据写入 %rsi 之前,该指令无法完成(CPU 可能会在 cmp+jne 执行过程中阻塞或多次重新启动那个命令).因此, cmp 有 24% 的开销,因为该 mov 确实错过了最近的缓存.使用更高级的计数器,您可以估计它错过了哪个缓存,以及哪个缓存/内存层最常为请求提供服务.

you have 24% of profiling events of the current function accounted to the cmp instruction. Default event for sampling profiling is "cycles" (hardware event for CPU clock cycles) or "cpu-clock" (software event for linear time). So, around 24% of sampling interrupts which did interrupt this function were reported for instruction address of this cmp command. There is systematic skew possible with profiling and modern Out-of-order CPUs, when cost is reported not for command which did run slowly, but for the command which did not finish its execution (retire) quickly. This cmp+jne command pair (fused uop) will change instruction flow of program if %rsi register value is not equal to %r15 register value. For ancient times such command should just read two registers and compare their values, which is fast and should not take 1/4 of function execution time. But with modern CPU registers are not just 32 or 64 bit place to store the value, they have some hidden flags (or renaming techniques) used in Out-of-order engines. In your example, there was mov 0x0(%rbp),%rsi which did change %rsi register. This command is load from memory by address *%rbp. CPU did start this load into cache/memory subsystem and mark %rsi register as "load pending from memory", continuing to execute instructions. There are some chances that next instructions will not require result of that load (which takes some time, for example Haswell: 4 cpu cycles for L1 hit, 12 for L2 hit, 36-66 for L3 hit, and additional 50-100 ns for cache miss and RAM read). But in your case next instruction was cmp+jne with read from %rsi, and that instruction can't be finished until data from memory is written to %rsi (CPU may block in the middle of cmp+jne execution or do many restarts of that command). So, cmp has 24% overhead because that mov did miss nearest caches. With more advanced counters you can estimate which cache it did miss, and which cache/memory layer did serve the request most often.

这里对应的源码是trace_lookup源码,如果我没看错的话,这条hot path cmp指令对应的代码行数是296行,但不知道为什么这行这么慢,成本最高时间?

The corresponding source code is here trace_lookup source code, If I read correctly, the number of lines of code corresponding to this hot path cmp instruction is line 296, but I don't know why this line is so slow and cost most of the time?

对于如此短的 asm 片段,很难在 trace_lookup 的源代码中找到相应的代码行,也很难找到 L1/L2 缓存中没有的值和原因.您应该尝试编写简短的可重现示例.

With so short asm fragment it can be hard to find corresponding code line in source code of trace_lookup and to find what value and why was not in L1/L2 cache. You should try to write shorted reproducible example.

这篇关于为什么cmp指令花费太多时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆