如何监控SIMD指令的使用量 [英] How do I monitor the amount of SIMD instruction usage

查看:106
本文介绍了如何监控SIMD指令的使用量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何监视进程的SIMD(SSE,AVX,AVX2,AVX-512)指令使用量?例如, htop 可以用于监视常规CPU使用情况,但不能用于监视特定的SIMD指令使用情况。

解决方案方案

我认为,计算所有 SIMD指令(不仅是FP数学)的唯一可靠方法是动态检测(例如,通过诸如Intel PIN / SDE之类的东西)。



请参见如何通过获取指令类型明细来表征工作负载?特别是 sde64 -mix-./my_program 为该运行打印程序的指令组合,示例输出在使用AVX而不是没有AVX编译的libsvm



我认为没有办法使它像 top / htop 一样,甚至有可能安全地附加到已经在运行的进程中,尤其是多个



也有可能获得动力mic指令使用最后分支记录的内容来计数/记录执行路径并计算所有内容,但我不知道用于该操作的工具。从理论上讲,它可以附加到已经在运行的程序上而没有太大的危险,但是要为所有正在运行的进程即时进行计算(分解和计数指令)。不只是向内核询问它在上下文切换中始终跟踪的CPU使用情况统计。



您需要硬件指令计数支持才能真正有效 top 是。






对于SIMD 浮点运算(不是FP随机播放,只是真正的FP数学,例如 vaddps ),还有性能计数器事件。



<例从性能列表输出:


fp_arith_inst_retired.128b_packed_single

[已退休的SSE / AVX计算128位压缩单精度
浮点指令的数量。 每个计数代表4次
计算
。适用于SSE *和AVX *打包的单精度
浮点指令:ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT
DPP FM(N)ADD / SUB。 DPP和FM(N)ADD / SUB指令的计数是
的两倍,它们对每个元素执行多次计算]


即使算是微指令,也算是FLOPS。 ... pd 打包双打还有其他事件,每个都有256位版本。 (我假设在具有AVX512的CPU上,也有这些事件的512位矢量版本。)



您可以使用 perf 来统计它们在整个进程和所有内核上的全局执行情况。或对于单个进程

  ##仅计算数学指令,而不是SIMD整数,加载/存储或其他任何
perf stat -e cycle:u,instructions:u,fp_arith_inst_retired。{128,256} b_packed_ {double,single}:u ./my_program
#fixme:括号扩展未正确扩展;它用空格分隔,而不是逗号。

(故意省略 fp_arith_inst_retired.scalar_ {double,single} ,因为您只询问过XMM寄存器上的SIMD和标量指令,IMO不算在内。)



您可以附加 -p PID 而不是命令> perf 到运行的进程。使用 perf top ,如
中所建议,请参见



您可以运行 perf stat -a 来全局监视所有内核,而不管执行的是什么进程。但是同样,这仅涉及FP数学,



仍然,它是硬件支持的,因此对于 htop 之类的东西可能足够便宜。如果长时间运行,可以在不浪费大量CPU时间的情况下使用它-


How can I monitor the amount of SIMD (SSE, AVX, AVX2, AVX-512) instruction usage of a process? For example, htop can be used to monitor general CPU usage, but not specifically SIMD instruction usage.

解决方案

I think the only reliable way to count all SIMD instructions (not just FP math) is dynamic instrumentation (e.g. via something like Intel PIN / SDE).

See How to characterize a workload by obtaining the instruction type breakdown? and How do I determine the number of x86 machine instructions executed in a C program? specifically sde64 -mix -- ./my_program to print the instruction mix for your program for that run, example output in libsvm compiled with AVX vs no AVX

I don't think there's a good way to make this like top / htop, if it's even possible to safely attach to already-running processes, especially multi-threaded once.

It might also be possible to get dynamic instruction counts using last-branch-record stuff to record / reconstruct the path of execution and count everything, but I don't know of tools for that. In theory that could attach to already-running programs without much danger, but it would take a lot of computation (disassembling and counting instructions) to do it on the fly for all running processes. Not like just asking the kernel for CPU usage stats that it tracks anyway on context switches.

You'd need hardware instruction-counting support for this to be really efficient the way top is.


For SIMD floating point math specifically (not FP shuffles, just real FP math like vaddps), there are perf counter events.

e.g. from perf list output:

fp_arith_inst_retired.128b_packed_single
[Number of SSE/AVX computational 128-bit packed single precision floating-point instructions retired. Each count represents 4 computations. Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP RSQRT SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element]

So it's not even counting uops, it's counting FLOPS. There are other events for ...pd packed double, and 256-bit versions of each. (I assume on CPUs with AVX512, there are also 512-bit vector versions of these events.)

You can use perf to count their execution globally across processes and on all cores. Or for a single process

## count math instructions only, not SIMD integer, load/store, or anything else
perf stat -e cycles:u,instructions:u,fp_arith_inst_retired.{128,256}b_packed_{double,single}:u  ./my_program
# fixme: that brace-expansion doesn't expand properly; it separates with spaces not commas.

(Intentionally omitting fp_arith_inst_retired.scalar_{double,single} because you only asked about SIMD and scalar instructions on XMM registers don't count, IMO.)

(You can attach perf to a running process by using -p PID instead of a command. Or use perf top as suggested in See Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?

You can run perf stat -a to monitor globally across all cores, regardless of what process is executing. But again, this only counts FP math, not SIMD in general.

Still, it is hardware-supported and thus could be cheap enough for something like htop to use without wasting a lot of CPU time if you leave it running long-term.

这篇关于如何监控SIMD指令的使用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆