如何确定在C程序中执行的x86机器指令的数量? [英] How do I determine the number of x86 machine instructions executed in a C program?

查看:104
本文介绍了如何确定在C程序中执行的x86机器指令的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理一个家庭作业问题,该问题要求我找出运行用C语言编写的简短程序时执行的机器代码指令的数量.

I'm currently working on a homework problem that asks me to find out the number of machine code instructions that are executed when running a short program I wrote in C.

这个问题说我能够使用我想弄清楚的任何工具,但是我对C还是很陌生,对如何解决这个问题几乎一无所知.

The question says I am able to use whatever tools I want to figure it out, but I'm fairly new to C and have very little idea how to go about this.

我需要哪种类型的工具来解决这个问题?

What types of tools do I need to figure this out?

推荐答案

术语:您要的是 dynamic 指令计数.例如每次执行时在循环内对一条指令进行计数.这通常与性能大致相关,但是每个周期的指令可能会有很大差异.

Terminology: what you're asking for is dynamic instruction count. e.g. counting an instruction inside a loop every time it's executed. This is usually roughly correlated with performance, but instructions-per-cycle can vary wildly.

  • How many CPU cycles are needed for each assembly instruction?
  • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

人们还会看到的是 static 指令计数(或更通常是代码大小,因为这对于指令缓存的占用空间和磁盘加载时间确实很重要).对于像x86这样的变长指令集,它们是相关的,但不是同一件事.在具有固定长度指令的RISC(例如MIPS或AArch64)上,它距离更近,但是例如,您仍然需要填充以对齐功能的开始.那是一个完全独立的指标. gcc -Os针对代码大小进行了优化,同时尝试不牺牲太多速度.

Something people also look at is static instruction count (or more usually just code-size, because that's what really matters for instruction-cache footprint, and disk-load times). For variable-length instruction sets like x86, those are correlated but not the same thing. On a RISC with fixed-length instructions, like MIPS or AArch64, it's closer but you still have padding for alignment of the start of functions, for example. That's a totally separate metric. gcc -Os optimizes for code-size while trying not to sacrifice to much speed.

如果您使用的是Linux,请使用gcc -O2 foo.c编译代码. -O2没有为gcc启用自动矢量化. (它适用于clang).这可能是一个很好的基准优化水平,它将消除您的C代码中实际上不需要发生的事情,以避免使用更多或更少的tmp变量来分解一个大表达式之间的愚蠢差异.如果要进行最小程度的优化,请使用-Og;如果要真正笨的脑残代码,可以使用-O0来分别编译每个语句,并且永远不要在语句之间的寄存器中保留任何内容. (为什么clang会产生低效率带有-O0的asm(对于这个简单的浮点数)?).

If you're on Linux, use gcc -O2 foo.c to compile your code. -O2 doesn't enable auto-vectorization for gcc. (It does for clang). It's probably a good baseline level of optimization that will get rid of stuff in your C code that doesn't actually need to happen, to avoid silly differences between using more or fewer tmp variables to break up a big expression. Maybe use -Og if you want minimal optimization, or -O0 if you want really dumb braindead code that compiles each statement separately and never keeps anything in registers between statements. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?).

是的,重要的是巨大数量如何编译.如果gcc -O3 -march=native -ffast-math自动对循环进行矢量化处理,则它可能会使用更少的指令.

Yes, it matters a huge amount how you compile. gcc -O3 -march=native -ffast-math might use a lot fewer instructions, if it auto-vectorizes a loop.

要阻止代码进行优化,请从命令行arg中获取输入,或从volatile变量中读取输入.类似于volatile int size_volatile = 1234; int size = size_volatile;.返回或打印结果,因为如果程序没有副作用,那么最有效的实现就是立即退出.

To stop your code from optimizing away, take an input from a command-line arg, or read it from a volatile variable. Like volatile int size_volatile = 1234; int size = size_volatile;. And return or print a result, because if the program has no side-effects then the most efficient implementation is to just exit immediately.

然后运行 perf stat ./a.out .这将使用硬件性能计数器为您提供代表您的进程执行的全部指令,包括内核内部的指令. (以及其他计数器,例如CPU核心时钟周期,以及一些软件计数器,例如page-faults和以毫秒为单位的时间.)

Then run perf stat ./a.out. That will use hardware performance counters to give you total instructions executed on behalf of your process, including inside the kernel. (Along with other counters, like CPU core clock cycles, and some software counters like page-faults and time in microseconds.)

要仅计算用户空间指令,请使用perf stat -e instructions:u ./a.out .即使对于一个简单的"hello world"程序(例如180k),这仍然是一个很大的数目,因为它包括动态链接程序启动以及在库函数中运行的所有代码. CRT启动代码会调用您的main,并且如果您返回而不是调用exit(3),则会使用main的返回值进行exit系统调用.

To count only user-space instructions, use perf stat -e instructions:u ./a.out. That will still be a very big number even for a simple "hello world" program, like 180k, because that includes dynamic-linker startup and all the code that runs inside library functions. And CRT startup code that calls your main, and that makes an exit system call with main's return value, if you return instead of calling exit(3).

通过使用gcc -O2 -static -fno-stack-protector -fno-pie -no-pie

You might statically link your C program to reduce that startup overhead, by compiling with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie

perf计数instructions:u在我的Skylake CPU上似乎非常准确.静态链接的x86-64二进制文件仅包含2条指令mov eax, 231/syscall,被视为3条指令.在内核和用户模式之间的过渡中可能还需要计算一条额外的指令,但这很少.

perf counting instructions:u seems to be pretty accurate on my Skylake CPU. A statically-linked x86-64 binary that contains only 2 instructions, mov eax, 231 / syscall, is counted as 3 instructions. Probably there's one extra instruction being counted in the transition between kernel and user mode, but that's pretty minor.

$ perf stat -e instructions:u ./exit    # hand-written in asm to check for perf overhead

 Performance counter stats for './exit':

                 3      instructions:u                                              

       0.000651529 seconds time elapsed

调用两次puts的静态链接二进制数将计算为33,202 instructions:u,并使用gcc -O2 -static -fno-stack-protector -fno-pie -no-pie hello.c进行编译.在调用main之前,对于glibc初始化函数(包括stdio和CRT启动项)似乎是合理的. (main本身只有8条指令,我用objdump -drwC -Mintel a.out | less进行了检查.)

A statically-linked binary that calls puts twice counts 33,202 instructions:u, compiled with gcc -O2 -static -fno-stack-protector -fno-pie -no-pie hello.c. Seems reasonable for glibc init functions, including stdio, and CRT startup stuff before calling main. (main itself only has 8 instructions, which I checked with objdump -drwC -Mintel a.out | less).

@MichaelPetch的答案显示了如何使用替代的libc(MUSL),它不需要启动代码即可运行其printf即可工作.因此,您可以编译C程序并将其main设置为ELF入口点(并调用_exit()而不是返回).

@MichaelPetch's answer shows how to use an alternate libc (MUSL) that doesn't need startup code to run for its printf to work. So you can compile a C program and set its main as the ELF entry point (and call _exit() instead of returning).

如何配置在C ++代码上运行Linux吗?有很多用于查找热点的分析工具和昂贵的功能(包括调用它们所花费的时间,即堆栈回溯分析).不过,大多数情况下,这与计数指令无关.

How can I profile C++ code running on Linux? There are tons of profiling tools for finding hotspots, and expensive functions (including the time spent in functions they call, i.e. stack backtrace profiling). Mostly this isn't about counting instructions, though.

这些是用于计数指令的重型工具,包括仅对特定种类的指令进行计数.

These are the heavy duty tools for counting instructions, including counting only specific kinds of instructions.

  • Intel Pin - A Dynamic Binary Instrumentation Tool
  • Intel® Software Development Emulator (SDE) This is based on PIN, and is handy for things like testing AVX512 code on a dev machine that doesn't support AVX512. (It dynamically recompiles so most instructions run natively, but unsupported instructions call an emulation routine.)

例如, sde64 -mix -- ./my_program 将为您的程序打印指令混合,以及每个不同指令的总数,并按类别进行细分.有关此类示例,请参见用AVX编译而没有AVX编译的libsvm 输出.

它还为您提供了每个功能,每个线程和全局的动态指令总数的表格. SDE混合输出在PIE可执行文件上不能很好地工作:它认为动态链接器是可执行文件(因为它是可执行文件),因此请使用gcc -O2 -no-pie -fno-pie prog.c -o prog进行编译.但是,它仍然没有在Hello World测试程序的配置文件输出中看到puts调用或main本身,我也不知道为什么.

It also gives you a table of total dynamic instruction counts per-function, as well as per-thread and global. SDE mix output doesn't work well on PIE executable, though: it thinks the dynamic linker is the executable (because it is), so compile with gcc -O2 -no-pie -fno-pie prog.c -o prog. It still doesn't see the puts calls or main itself in the profile output for a hello world test program, though, and I don't know why not.

使用SDE计数某些种类的指令的示例,例如vfmadd231pd.

Calculating "FLOP" using Intel® Software Development Emulator (Intel® SDE) An example of using SDE to count certain kinds of instructions, like vfmadd231pd.

英特尔CPU具有类似fp_arith_inst_retired.256b_packed_double这样的事件的硬件性能计数器,因此您可以使用它们来计数FLOP.他们实际上将FMA视为2个事件.因此,如果您拥有可以本地运行代码的Intel CPU,则可以使用perf stat -e -e fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.scalar_double代替. (和/或单精度事件.)

Intel CPUs have HW perf counters for events like fp_arith_inst_retired.256b_packed_double, so you can use those to count FLOPs instead. They actually count FMA as 2 events. So if you have an Intel CPU that can run your code natively, you can do that instead with perf stat -e -e fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.scalar_double. (And/or the events for single-precision.)

但是对于大多数其他特定种类的指令,没有事件,只有FP数学.

But there aren't events for most other specific kinds of instructions, only FP math.

这些都是英特尔的东西; AMD拥有的IDK,或者x86以外的ISA的任何东西.这些只是我所听说的工具;我确定有很多事情我会遗漏.

这篇关于如何确定在C程序中执行的x86机器指令的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆