如何确定在 C 程序中执行的 x86 机器指令的数量? [英] How do I determine the number of x86 machine instructions executed in a C program?

查看:24
本文介绍了如何确定在 C 程序中执行的 x86 机器指令的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在解决一个家庭作业问题,该问题要求我找出在运行我用 C 编写的短程序时执行的机器代码指令的数量.

这个问题说我可以使用任何我想弄清楚的工具,但我对 C 相当陌生,并且不知道如何去做.

我需要什么类型的工具来解决这个问题?

解决方案

术语:您要求的是动态指令计数.例如每次执行时计算循环内的指令.这通常与性能大致相关,但每个周期的指令可能会有很大差异.

人们还关注静态指令计数(或者更常见的是代码大小,因为这对指令缓存占用空间和磁盘加载时间非常重要).对于像 x86 这样的可变长度指令集,它们是相关的,但不是一回事.例如,在具有固定长度指令的 RISC 上,例如 MIPS 或 AArch64,它更接近,但您仍然有用于对齐函数开头的填充.这是一个完全独立的指标.gcc -Os 优化代码大小,同时尽量不牺牲太多速度.


如果您使用的是 Linux,请使用 gcc -O2 foo.c 来编译您的代码.-O2 不为 gcc 启用自动矢量化.(它适用于叮当声).这可能是一个很好的优化基线级别,它将摆脱 C 代码中实际上不需要发生的东西,以避免使用更多或更少 tmp 变量来分解大表达式之间的愚蠢差异.如果您想要最小的优化,也许使用 -Og ,或者如果您想要真正愚蠢的脑残代码,分别编译每个语句并且从不在语句之间的寄存器中保留任何内容,请使用 -O0 .(为什么clang生产效率低下asm 与 -O0(对于这个简单的浮点和)?).

是的,大量如何编译很重要.gcc -O3 -march=native -ffast-math 可能会使用更少的指令,如果它自动矢量化一个循环.

要阻止您的代码进行优化,请从命令行参数获取输入,或从 volatile 变量中读取它.比如volatile int size_volatile = 1234; int size = size_volatile;.并返回或打印结果,因为如果程序没有副作用,那么最有效的实现就是立即退出.


然后运行perf stat ./a.out.这将使用硬件性能计数器为您提供代表您的进程执行的总指令.(连同其他计数器,如 CPU 内核时钟周期,以及一些软件计数器,如 page-faults 和时间(以微秒为单位).)

要仅计算用户空间指令,请使用 perf stat -e instructions:u ./a.out.(或者在最近的 perf 版本中,perf stat --all-user ./a.out 将 :u 应用于所有事件,甚至是默认设置.)每个硬件事件计数器有 2 位,指示是否它应该对用户、主管或两者中的事件进行计数,因此内核的性能代码不必运行指令来停止 :u 事件或类似事件的计数器.

即使对于简单的hello world",这仍然是一个非常大的数字.程序,如 180k 如果正常构建,因为它包括动态链接器启动和所有在库函数中运行的代码.以及调用您的 main 的 CRT 启动代码,如果您返回而不是调用,它会使用 main 的返回值进行 exit 系统调用exit(3).

您可以静态链接您的 C 程序以减少启动开销,方法是使用 gcc -O2 -static -fno-stack-protector -fno-pie -no-pie强>


perf 计算 instructions:u 在我的 Skylake CPU 上似乎非常准确.一个静态链接的 x86-64 二进制文件只包含 2 条指令,得到 3 次计数. 显然,在内核模式和用户模式之间向一个方向的转换中,有一条额外的指令被计算在内,但这非常小.>

$ cat >exit.asm <<EOF全局 _start ;手写 asm 来检查性能开销_开始:移动 eax, 231 ;_NR_exit_group系统调用;exit_group(EDI)(实际上为零)EOF$ nasm -felf64 exit.asm &&ld -o exit exit.o # 静态可执行文件,没有 CRT 或 libc$ perf stat -e 说明:u ./exit'./exit' 的性能计数器统计信息:3 条指令:u0.000651529 秒时间过去# 对于这个 2 指令手写程序

单独使用 ld 有点类似于使用 gcc -nostdlib -static 链接(这也意味着 -no-pie; static-pie 是一个单独的东西)


使用 CRT 和 libc 的 C 程序的最小指令数:大约 33k

由 C 编译器生成的静态链接二进制文件,调用 puts 两次,计数 33,202 条指令:u.我用 gcc -O2 -static -fno-stack-protector -fno-pie -no-pie hello.c 编译.

在调用 main 之前,对于 glibc init 函数(包括 stdio 和 CRT 启动内容)似乎是合理的.(main 本身只有 8 条指令,我用 objdump -drwC -Mintel a.out | less 检查).

如果 main 在没有打印的情况下退出,或者特别是如果它调用了 _exit(0)exit_group(0)(原始系统调用,绕过 atexit 的东西),你会因为不使用 stdio 而获得更少的指令.


其他资源:

  • Hello World 程序 Nasm Assembly 和 C 的执行指令数不同

    @MichaelPetch 的回答显示了如何使用不需要启动代码来运行其 printf 的备用 libc (MUSL).所以你可以编译一个 C 程序并将它的 main 设置为 ELF 入口点(并调用 _exit() 而不是返回).

  • 如何分析运行在Linux? 有大量的分析工具可用于查找热点和昂贵的函数(包括花费在它们调用的函数上的时间,即堆栈回溯分析).不过,这主要与计算指令无关.


二进制检测工具:

这些是用于计算指令的重型工具,包括仅计算特定种类的指令.

  • 英特尔 Pin - A动态二进制检测工具

  • 英特尔® 软件开发模拟器 (SDE) 这是基于 PIN 的,对于在不支持 AVX512 的开发机器上测试 AVX512 代码之类的事情很方便.(它会动态重新编译,因此大多数指令本机运行,但不受支持的指令调用仿真例程.)

    例如,sde64 -mix -- ./my_program 将为您的程序打印指令混合,包括每个不同指令的总数,以及按类别分类的分类.有关此类示例,请参阅 使用 AVX 编译的 libsvm 与不使用 AVX输出.

    它还为您提供了每个函数以及每个线程和全局的动态指令总数表.SDE 混合输出在 PIE 可执行文件上效果不佳:它认为动态链接器是可执行文件(因为它是),因此使用 gcc -O2 -no-pie -fno 进行编译-pie prog.c -o prog.但是,它仍然没有在 hello world 测试程序的配置文件输出中看到 puts 调用或 main 本身,而且我不知道为什么.>

  • 使用英特尔® 软件开发模拟器(英特尔® SDE)计算FLOP" 使用 SDE 计算特定类型指令的示例,例如 vfmadd231pd.

    Intel CPU 具有用于诸如 fp_arith_inst_retired.256b_packed_double 之类的事件的硬件性能计数器,因此您可以使用它们来计算 FLOP.他们实际上将FMA算作2个事件.因此,如果您有一个可以本地运行代码的 Intel CPU,您可以使用 perf stat -e -e fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.scalar_double(和/或单精度事件.)

    但是对于大多数其他特定类型的指令没有事件,只有 FP 数学.

这都是英特尔的东西;IDK AMD 拥有什么,或任何适用于 x86 以外的 ISA 的东西.这些只是我听说过的工具;我确定我遗漏了很多东西.

I'm currently working on a homework problem that asks me to find out the number of machine code instructions that are executed when running a short program I wrote in C.

The question says I am able to use whatever tools I want to figure it out, but I'm fairly new to C and have very little idea how to go about this.

What types of tools do I need to figure this out?

解决方案

Terminology: what you're asking for is dynamic instruction count. e.g. counting an instruction inside a loop every time it's executed. This is usually roughly correlated with performance, but instructions-per-cycle can vary wildly.

Something people also look at is static instruction count (or more usually just code-size, because that's what really matters for instruction-cache footprint, and disk-load times). For variable-length instruction sets like x86, those are correlated but not the same thing. On a RISC with fixed-length instructions, like MIPS or AArch64, it's closer but you still have padding for alignment of the start of functions, for example. That's a totally separate metric. gcc -Os optimizes for code-size while trying not to sacrifice to much speed.


If you're on Linux, use gcc -O2 foo.c to compile your code. -O2 doesn't enable auto-vectorization for gcc. (It does for clang). It's probably a good baseline level of optimization that will get rid of stuff in your C code that doesn't actually need to happen, to avoid silly differences between using more or fewer tmp variables to break up a big expression. Maybe use -Og if you want minimal optimization, or -O0 if you want really dumb braindead code that compiles each statement separately and never keeps anything in registers between statements. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?).

Yes, it matters a huge amount how you compile. gcc -O3 -march=native -ffast-math might use a lot fewer instructions, if it auto-vectorizes a loop.

To stop your code from optimizing away, take an input from a command-line arg, or read it from a volatile variable. Like volatile int size_volatile = 1234; int size = size_volatile;. And return or print a result, because if the program has no side-effects then the most efficient implementation is to just exit immediately.


Then run perf stat ./a.out. That will use hardware performance counters to give you total instructions executed on behalf of your process. (Along with other counters, like CPU core clock cycles, and some software counters like page-faults and time in microseconds.)

To count only user-space instructions, use perf stat -e instructions:u ./a.out. (Or in recent perf versions, perf stat --all-user ./a.out to apply :u to all events, even the default set.) Each hardware event counter has 2 bits that indicate whether it should be counting events in user, supervisor, or both, so the kernel's perf code doesn't have to run an instruction to stop counters for :u events or anything like that.

That will still be a very big number even for a simple "hello world" program, like 180k if built normally, because that includes dynamic-linker startup and all the code that runs inside library functions. And CRT startup code that calls your main, and that makes an exit system call with main's return value, if you return instead of calling exit(3).

You might statically link your C program to reduce that startup overhead, by compiling with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie


perf counting instructions:u seems to be pretty accurate on my Skylake CPU. A statically-linked x86-64 binary that contains only 2 instructions gets 3 counts. Apparently there's one extra instruction being counted in the transition between kernel and user mode in one direction, but that's pretty minor.

$ cat > exit.asm <<EOF
global _start       ; hand-written asm to check perf overhead
_start:
    mov eax, 231     ; _NR_exit_group
    syscall          ; exit_group(EDI) (in practice zero)
EOF
$ nasm -felf64 exit.asm && ld -o exit  exit.o   # static executable, no CRT or libc
$ perf stat -e instructions:u ./exit

 Performance counter stats for './exit':

                 3      instructions:u                                              

       0.000651529 seconds time elapsed

# for this 2-instruction hand-written program

Using ld on its own is somewhat similar to linking with gcc -nostdlib -static (which also implies -no-pie; static-pie is a separate thing)


Minimal instruction count for a C program with CRT and libc: about 33k

A statically-linked binary made by the C compiler that calls puts twice counts 33,202 instructions:u. I compiled with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie hello.c.

Seems reasonable for glibc init functions, including stdio, and CRT startup stuff before calling main. (main itself only has 8 instructions, which I checked with objdump -drwC -Mintel a.out | less).

If main just exited without printing, or especially if it called _exit(0) or exit_group(0) (the raw system calls, bypassing atexit stuff), you'd have fewer instructions from not using stdio.


Other resources:


Binary instrumentation tools:

These are the heavy duty tools for counting instructions, including counting only specific kinds of instructions.

  • Intel Pin - A Dynamic Binary Instrumentation Tool

  • Intel® Software Development Emulator (SDE) This is based on PIN, and is handy for things like testing AVX512 code on a dev machine that doesn't support AVX512. (It dynamically recompiles so most instructions run natively, but unsupported instructions call an emulation routine.)

    For example, sde64 -mix -- ./my_program will print an instruction-mix for your program, with total counts for each different instruction, and breakdowns by categories. See libsvm compiled with AVX vs no AVX for an example of the kind of output.

    It also gives you a table of total dynamic instruction counts per-function, as well as per-thread and global. SDE mix output doesn't work well on PIE executable, though: it thinks the dynamic linker is the executable (because it is), so compile with gcc -O2 -no-pie -fno-pie prog.c -o prog. It still doesn't see the puts calls or main itself in the profile output for a hello world test program, though, and I don't know why not.

  • Calculating "FLOP" using Intel® Software Development Emulator (Intel® SDE) An example of using SDE to count certain kinds of instructions, like vfmadd231pd.

    Intel CPUs have HW perf counters for events like fp_arith_inst_retired.256b_packed_double, so you can use those to count FLOPs instead. They actually count FMA as 2 events. So if you have an Intel CPU that can run your code natively, you can do that instead with perf stat -e -e fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.scalar_double. (And/or the events for single-precision.)

    But there aren't events for most other specific kinds of instructions, only FP math.

This is all Intel stuff; IDK what AMD has, or any stuff for ISAs other than x86. These are just the tools I've heard of; I'm sure there are lots of things I'm leaving out.

这篇关于如何确定在 C 程序中执行的 x86 机器指令的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆