如何使用 vtune 分析加法、乘法等的数量 [英] How to profile the number of additions, mutltiplications etc. with vtune

查看:97
本文介绍了如何使用 vtune 分析加法、乘法等的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用INST_RETIRED.ANY"事件通过 Vtune 分析我的 C++ 库的指令计数.

I am able to profile my C++ library's instruction counts with Vtune using the 'INST_RETIRED.ANY' event.

就整数/浮点加法、乘法、除法等的数量而言,可以使用哪些分析类型或事件?

What analysis types or events can be used profile in terms of number of integer/floating point additions, multiplications, divisions etc?

推荐答案

(tl:dr): 我认为你不能用性能计数器做任何你想做的事情.有关使用二进制检测的可能方法,请参阅此答案的结尾

(tl:dr): I don't think you can do everything you want with perf counters. See the end of this answer for a possible way using binary instrumentation

另请注意,imul 不是一个开销很大的操作,而 FP mul 只比 add 开销大一点.例如在 Skylake 上,mulpsaddpsfma 都具有相同的性能(吞吐量、延迟、uop 和执行端口的选择).在 Skylake 之前,添加延迟更低,但吞吐量也减半,因为有专用添加单元.

Also note that imul is not an expensive operation, and FP mul is barely more expensive than add. e.g. on Skylake, mulps, addps, and fma all have the same performance (throughput, latency, uops, and choice of execution ports). On pre-Skylake, add was lower latency but also half throughput, since there's a dedicated add unit.

与其说是 VTUNE 能做什么,不如说是硬件性能计数器能做什么.例如这个性能计数器事件表来自 Linux oprofile 在我搜索时出现用于 Sandybridge 性能计数器.还有 这个更完整的 Linux 列表 perf.如果硬件能算出来,我想只要你找到合适的名字,VTUNE 就能展示给你.

It's not so much what VTUNE can do, as what the hardware performance counters can count. e.g. this table of perf-counter events from Linux oprofile came up when I searched for Sandybridge perf counters. Also this more-complete listing for Linux perf. If the hardware can count it, I assume VTUNE can show it to you, once you find the right name.

在具有已知行为的简单代码上测试这些计数器,以确保它们在您已经知道代码在做什么的情况下以您期望的方式工作.

Test these counters on simple code with known behaviour, so to make sure they work the way you expect when you already know what the code is doing.

我只浏览了 Sandybridge 支持的内容.我认为 Haswell/Skylake 也有这些事件,而且可能更多.你没有说你有什么 CPU,所以我不会检查所有的.

I only looked through what Sandybridge supports. I assume Haswell/Skylake have these events, too, and probably more. You didn't say what CPU you have, so I'm not going to check all of them.

Pre-SnB 的性能计数器选项几乎没有 IIRC 那么多.英特尔改进了 SnB 中的性能计数器,以及对内核的其他重大更改.足够大以至于它通常被认为是一个新的微架构系列,与 P6 系列 (PPro-Nehalem) 分开.

Pre-SnB don't have nearly as wide a selection of perf counters, IIRC. Intel improved perf counters a lot in SnB, along with other big changes to the core. Big enough that it's generally considered a new microarchitecture family, separate from the P6 family (PPro-Nehalem).

我认为你无法区分整数加法和整数倍数,或者FP加法和FP mul.不过,您可以计算 FP 活动:FP_COMP_OPS_EXE计算浮点事件的数量",带有 x87 和 {packed,scalar}{single,double} 的掩码.

I don't think you can distinguish integer add from integer mul, or FP add from FP mul. You can count FP activity, though: FP_COMP_OPS_EXE "Counts number of floating point events", with masks for x87 and {packed,scalar}{single,double}.

还有 SIMD_FP_256,它只计算 256b 向量 FP 操作.

There's also SIMD_FP_256, which counts only 256b vector FP ops.

有一个用于 FP 辅助事件的计数器(当 FP 操作需要回退到微码以处理异常或其他事情时).

There's a counter for FP-assist events (when an FP operation needs to fall back to microcode to handle a denormal or something).

我不确定这是否正确,但是 perf 列表说有一个带有 Umask-02 的 PARTIAL_RAT_STALLS : 0x80: [MUL_SINGLE_UOP]:分配的乘法压缩/标量单精度微指令的数量.奇怪的是没有类似的双精度计数器.或者,也许 mulss 在部分寄存器行为中有些特殊,PARTIAL_RAT_STALLS 有另一个子偶数来计算部分寄存器合并 uops.

I'm not sure this is right, but the perf listing says there's a PARTIAL_RAT_STALLS with Umask-02 : 0x80: [MUL_SINGLE_UOP]: Number of Multiply packed/scalar single precision uops allocated. It's odd that there's not a similar double-precision counter. Or maybe mulss is somehow special in partial-register behaviour, with PARTIAL_RAT_STALLS has another sub-even to count partial-register merging uops.

divide (div/divps) 足够慢,值得拥有一个特殊的计数器,不过:SnB 的 arith.fpu_divcounter = "分频器被激活的次数,包括 INT、SIMD 和 FP."还有一个计数器用于显示分频器处于活动状态的周期,而不是它被激活的.

divide (div / divps) is slow enough to be worth having a special counter, though: SnB's arith.fpu_div counter = "Number of times that the divider is actived, includes INT, SIMD and FP." There's also a counter for number of cycles the divider is active, rather than the number of times it was activated.

英特尔的 Pin 是一个用于 IA-32 和 x86-64 指令集架构的动态二进制检测框架,支持创建动态程序分析工具

Intel's Pin is a dynamic binary instrumentation framework for the IA-32 and x86-64 instruction-set architectures that enables the creation of dynamic program analysis tools

我没有 VTUNE,但可能有一些方法可以在 VTUNE 中使用 Pin 工具.它会让你的代码运行得更慢,可能会慢很多.我认为它通过 JIT 编译从普通机器代码到检测机器代码来工作,其中检测是增加计数器的额外指令.它可能有其他操作模式,更像是单步执行原始代码并沿途计算东西.

I don't have VTUNE, but there may be ways to use Pin tools from within VTUNE. It will make your code run a slower, potentially a lot slower. I think it works by JIT-compiling from normal machine code to instrumented machine code, where the instrumentation is extra instructions to increment counters. It might have other modes of operation, more like single-stepping the original code and counting stuff along the way.

这篇关于如何使用 vtune 分析加法、乘法等的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆