性能启动开销:为什么执行MOV + SYS_exit的简单静态可执行文件为何会有如此多的停顿周期(和指令)? [英] Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?
问题描述
我试图了解如何衡量性能,并决定编写一个非常简单的程序:
I'm trying to understand how to measure performance and decided to write the very simple program:
section .text
global _start
_start:
mov rax, 60
syscall
然后我用perf stat ./bin
运行了程序.令我惊讶的是stalled-cycles-frontend
太高了.
And I ran the program with perf stat ./bin
The thing I was surprised by is the stalled-cycles-frontend
was too high.
0.038132 task-clock (msec) # 0.148 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.052 M/sec
107,386 cycles # 2.816 GHz
81,229 stalled-cycles-frontend # 75.64% frontend cycles idle
47,654 instructions # 0.44 insn per cycle
# 1.70 stalled cycles per insn
8,601 branches # 225.559 M/sec
929 branch-misses # 10.80% of all branches
0.000256994 seconds time elapsed
据我了解stalled-cycles-frontend
,这意味着CPU前端必须等待某些操作(例如总线事务)的结果完成.
As I understand the stalled-cycles-frontend
it means that CPU frontend has to wait for the result of some operation (e.g. bus-transaction) to complete.
那么在最简单的情况下,导致CPU前端大部分时间等待的原因是什么?
So what caused CPU frontend to wait for most of the time in that simplest case?
还有2个页面错误?为什么?我没有读取任何内存页面.
And 2 page faults? Why? I read no memory pages.
推荐答案
页面错误包括代码页.
perf stat
包括启动开销.
IDK有关perf
如何开始计数的详细信息,但是大概它必须在内核模式下对性能计数器进行编程,因此它们在 进行计数时,CPU切换回用户模式(为很多周期,特别是在具有Meltdown防御的内核上,这会使TLB失效.
IDK the details of how perf
starts counting, but presumably it has to program the performance counters in kernel mode, so they're counting while the CPU switches back to user mode (stalling for many cycles, especially on a kernel with Meltdown defenses which invalidates the TLBs).
我想记录的大多数47,654
指令都是内核代码.也许包括页面错误处理程序!
I guess most of the 47,654
instructions that were recorded was kernel code. Perhaps including the page-fault handler!
我猜您的进程永远不会进入user-> kernel-> user,整个进程都是kernel-> user-> kernel(启动时,syscall
会调用sys_exit
,然后再也不会返回到用户空间),所以无论如何,永远都不会出现TLB变热的情况,除非在sys_exit
系统调用之后在内核中运行时.而且无论如何,TLB的丢失不是页面错误,但这可以解释很多停滞的周期.
I guess your process never goes user->kernel->user, the whole process is kernel->user->kernel (startup, syscall
to invoke sys_exit
, then never returns to user-space), so there's never a case where the TLBs would have been hot anyway, except maybe when running inside the kernel after the sys_exit
system call. And anyway, TLB misses aren't page faults, but this would explain lots of stalled cycles.
用户->内核转换本身可以解释大约150个停顿的周期,顺便说一句. syscall
比未命中缓存快(除非它没有流水线,而且实际上刷新了整个流水线;即特权级别未重命名.)
The user->kernel transition itself explains about 150 stalled cycles, BTW. syscall
is faster than a cache miss (except it's not pipelined, and in fact flushes the whole pipeline; i.e. the privilege level is not renamed.)
这篇关于性能启动开销:为什么执行MOV + SYS_exit的简单静态可执行文件为何会有如此多的停顿周期(和指令)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!