性能启动开销:为什么执行MOV + SYS_exit的简单静态可执行文件为何会有如此多的停顿周期(和指令)? [英] Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

查看:101
本文介绍了性能启动开销:为什么执行MOV + SYS_exit的简单静态可执行文件为何会有如此多的停顿周期(和指令)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何衡量性能,并决定编写一个非常简单的程序:

I'm trying to understand how to measure performance and decided to write the very simple program:

section .text
    global _start

_start:
    mov rax, 60
    syscall

然后我用perf stat ./bin运行了程序.令我惊讶的是stalled-cycles-frontend太高了.

And I ran the program with perf stat ./bin The thing I was surprised by is the stalled-cycles-frontend was too high.

      0.038132      task-clock (msec)         #    0.148 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             2      page-faults               #    0.052 M/sec                  
       107,386      cycles                    #    2.816 GHz                    
        81,229      stalled-cycles-frontend   #   75.64% frontend cycles idle   
        47,654      instructions              #    0.44  insn per cycle         
                                              #    1.70  stalled cycles per insn
         8,601      branches                  #  225.559 M/sec                  
           929      branch-misses             #   10.80% of all branches        

   0.000256994 seconds time elapsed

据我了解stalled-cycles-frontend,这意味着CPU前端必须等待某些操作(例如总线事务)的结果完成.

As I understand the stalled-cycles-frontend it means that CPU frontend has to wait for the result of some operation (e.g. bus-transaction) to complete.

那么在最简单的情况下,导致CPU前端大部分时间等待的原因是什么?

So what caused CPU frontend to wait for most of the time in that simplest case?

还有2个页面错误?为什么?我没有读取任何内存页面.

And 2 page faults? Why? I read no memory pages.

推荐答案

页面错误包括代码页.

perf stat包括启动开销.

IDK有关perf如何开始计数的详细信息,但是大概它必须在内核模式下对性能计数器进行编程,因此它们在 进行计数时,CPU切换回用户模式(为很多周期,特别是在具有Meltdown防御的内核上,这会使TLB失效.

IDK the details of how perf starts counting, but presumably it has to program the performance counters in kernel mode, so they're counting while the CPU switches back to user mode (stalling for many cycles, especially on a kernel with Meltdown defenses which invalidates the TLBs).

我想记录的大多数47,654指令都是内核代码.也许包括页面错误处理程序!

I guess most of the 47,654 instructions that were recorded was kernel code. Perhaps including the page-fault handler!

我猜您的进程永远不会进入user-> kernel-> user,整个进程都是kernel-> user-> kernel(启动时,syscall会调用sys_exit,然后再也不会返回到用户空间),所以无论如何,永远都不会出现TLB变热的情况,除非在sys_exit系统调用之后在内核中运行时.而且无论如何,TLB的丢失不是页面错误,但这可以解释很多停滞的周期.

I guess your process never goes user->kernel->user, the whole process is kernel->user->kernel (startup, syscall to invoke sys_exit, then never returns to user-space), so there's never a case where the TLBs would have been hot anyway, except maybe when running inside the kernel after the sys_exit system call. And anyway, TLB misses aren't page faults, but this would explain lots of stalled cycles.

用户->内核转换本身可以解释大约150个停顿的周期,顺便说一句. syscall比未命中缓存快(除非它没有流水线,而且实际上刷新了整个流水线;即特权级别未重命名.)

The user->kernel transition itself explains about 150 stalled cycles, BTW. syscall is faster than a cache miss (except it's not pipelined, and in fact flushes the whole pipeline; i.e. the privilege level is not renamed.)

这篇关于性能启动开销:为什么执行MOV + SYS_exit的简单静态可执行文件为何会有如此多的停顿周期(和指令)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆