一个信任硬件计数器使用VsPerfCmd.exe进行性能分析应该走多远? [英] How far should one trust hardware counter profiling using VsPerfCmd.exe?

查看:153
本文介绍了一个信任硬件计数器使用VsPerfCmd.exe进行性能分析应该走多远?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用VsPerfCmd.exe来分析已检测到的本机应用程序中的分支错误预测和最后一级的缓存未命中.

安装程序按上的说明进行工作,但我得到的结果似乎并不明智.例如,据报道,始终触摸24MB数据集的功能在被调用约2000次时仅会导致约700个高速缓存未命中.现在让我看一下-函数线性遍历两个12字节元素的1024 * 1024元素的数组.对于每个元素,它随机决定它是否需要元素的信息,在元素之前或之后需要1024个索引.这意味着,为了不产生任何高速缓存未命中,CPU在高速缓存中始终必须至少具有这三个数组中每个数组的1024 * 12字节的至少三个部分.此外,在每次迭代之后,该进程都会使用sleep()大约8毫秒来产生CPU.我无法想象任何硬件预取器都能做到这一点.

与VsPerfCmd所说的相比,如此愚蠢的数据量不会产生更多的最后一级缓存未命中吗?即使我的i7具有8MB的共享L3缓存,这似乎也不太可能.任何人都可以就这里可能发生的事情分享他们的意见吗?当然,"VsPerfCmd.exe很烂"将是一个有效的答案,但是如果有人要这样说,我至少希望听到有人以类似的经验作为此断言的依据.

解决方案

回答我自己的问题- 因此,尝试使用 Intel VTune放大器XE验证VsPerfCmd结果后™(这不是广告,我喜欢像这样输入产品名称,因为这使我很傻(因为它们如此愚蠢)),我可以肯定地说它们是垃圾. >

这只是一个粗略的比较,因为我还没有找到如何从VTune中调用一个函数的次数,但是大约 900 个调用导致了 1,040,000 根据VTune,最后一级缓存未命中. 相对于使用VsPerfCmd配置的〜 2000 调用和报告的〜 700 LLC未命中,可以肯定地认为VTune结果更加合理.

我当然不能说比"VsPerfCmd很可能是错误的"更具体的东西-这种现象的原因和方式尚不清楚.如果有更多了解的人想对此进行详细说明,请给我留言!

I'm attempting to use VsPerfCmd.exe to profile branch misprediction and last level cache misses in an instrumented native application.

The setup works as it says on the tin, but the results I'm getting don't seem sensible. For instance, a function that always touches a data set of 24MB is reported to only cause ~700 cache misses when being called ~2000 times. Now let me put this into perspective - The function linearly traverses two arrays of 1024*1024 elements of 12-byte elements. For every element, it randomly decides whether it needs information of an element 1024 indices before or after it. That means in order to not generate any cache misses, the CPU would always have to have at least three sections of 1024*12 bytes each of both these arrays in cache. Furthermore, after every iteration the process yields the CPU using sleep() for about 8 milliseconds. I can't imagine any hardware prefetcher doing that good a job.

How would this silly amount of data not generate more last level cache misses than VsPerfCmd says? Even though my i7 has 8MB of shared L3 cache, this seems highly unlikely. Can anyone share their opinions on what might be going on here? Of course "VsPerfCmd.exe sucks" would be a valid answer but if someone is going to say that, I'd like to at least hear of a similar experience someone had as a basis for this assertion.

解决方案

Answering my own question - So, after trying to verify the VsPerfCmd results using Intel VTune Amplifier XE™ (this is no advertising, I just like typing out product names like that because it amuses my how they can be so silly), I can definitely say that they are garbage.

That's just a rough comparison, as I havent found out how to get the number of times a function was called from VTune, but an approximate 900 calls resulted in 1,040,000 Last Level Cache misses, according to VTune. Contrasting that to the ~ 2000 calls profiled with VsPerfCmd and and the reported ~ 700 LLC misses, it's safe to assume that the VTune results are much more reasonable.

Of course I cant say anything more specific than "VsPerfCmd was very likely wrong" - The why's and the how's of this phenomenon remain unclear. Should anyone who knows more feel an urge to elaborate on this, shoot me a comment!

这篇关于一个信任硬件计数器使用VsPerfCmd.exe进行性能分析应该走多远?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆