Xcode仪器拆卸时间分析的可靠性 [英] Reliability of Xcode Instrument's disassembly time profiling

查看：73 发布时间：2020/11/29 18:33:04 xcode x86 profiling instruments intel-pmu

本文介绍了Xcode仪器拆卸时间分析的可靠性的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经使用Instrument的时间剖析器对代码进行了剖析，并放大了反汇编，这是其结果的摘要:

I've profiled my code using Instrument's time profiler, and zooming in to the disassembly, here's a snippet of its results:

我不希望mov指令占用23.3％的时间，而div指令几乎不占用任何时间. 这使我相信这些结果是不可靠的. 这是真的吗?还是我只是遇到仪器错误?还是我需要使用一些选项来获得可靠的结果?

I wouldn't expect a mov instruction to take 23.3% of the time while a div instruction to take virtually nothing. This causes me to believe these results are unreliable. Is this true and known? Or am I just experiencing an Instruments bug? Or is there some option I need to use to obtain reliable results?

在这个问题上是否有任何参考资料?

Is there any reference expanding on this issue?

推荐答案

首先，很可能某些真正属于divss的计数被记入以后的说明中，. (有关更多详细信息，另请参见该注释线程的其余部分.)大概Xcode类似于Linux perf，并且对cycles使用固定的cpu_clk_unhalted.thread计数器，而不是可编程计数器之一.这不是精确"事件(PEBS)，因此可能发生打滑. @BeeOnRope指出，您可以使用每个周期滴答一次的PEBS事件(如UOPS_RETIRED < 16)代替固定周期计数器的PEBS，从而消除对中断行为的某些依赖.

First of all, it's possible that some counts that really belong to divss are being charged to later instructions, which is called a "skid". (Also see the rest of that comment thread for some more details.) Presumably Xcode is like Linux perf, and uses the fixed cpu_clk_unhalted.thread counter for cycles instead of one of the programmable counters. This is not a "precise" event (PEBS), so skids are possible. As @BeeOnRope points out, you can use a PEBS event that ticks once per cycle (like UOPS_RETIRED < 16) as a PEBS substitute for the fixed cycles counter, removing some of the dependence on interrupt behaviour.

但是，计数器从根本上为流水线/无序执行工作的方式也解释了您所看到的大部分内容.否则可能；您没有显示完整的循环，因此我们无法像IACA一样在简单的管道模型上模拟代码，也无法手动使用

But the way counters fundamentally work for pipelined / out-of-order execution also explains most of what you're seeing. Or it might; you didn't show the complete loop so we can't simulate the code on a simple pipeline model like IACA does, or by hand using hardware guides like http://agner.org/optimize/ and Intel's optimization manual. (And you haven't even specified what microarchitecture you have. I guess it's some member of Intel Sandybridge-family on a Mac).

cycles的计数通常记在等待结果的指令中，不是通常是缓慢产生结果的指令. 流水线式CPU不会停顿，直到您尝试读取尚未准备好的结果为止.

Counts for cycles are typically charged to the instruction that's waiting for the result, not usually the instruction that's slow to produce the result. Pipelined CPUs don't stall until you try to read a result that isn't ready yet.

乱序执行使这一过程变得非常复杂，但是当只有一条非常慢的指令时，例如在缓存中经常丢失的负载，这通常还是正确的.当cycles计数器溢出(触发中断)时，有许多指令正在执行中，但是与该性能计数器事件关联的RIP只能是一个.也是在中断后恢复执行的RIP.

Out-of-order execution massively complicates this, but it's still generally true when there's one really slow instruction, like a load that often misses in cache. When the cycles counter overflows (triggering an interrupt), there are many instruction in flight, but only one can be the RIP associated with that performance-counter event. It's also the RIP where execution will resume after the interrupt.

那么当引发中断时会发生什么呢?请参阅关于此问题的安迪·格里夫的答案，它解释了Intel P6微体系结构管道中perf-counter中断的内部情况，以及为什么(在PEBS之前)它们总是被延迟.桑迪布里奇(Sandybridge)家庭在这方面与P6类似.

So what happens when an interrupt is raised? See Andy Glew's answer about that, which explains the internals of perf-counter interrupts in the Intel P6 microarchitecture's pipeline, and why (before PEBS) they were always delayed. Sandybridge-family is similar to P6 for this.

我认为Intel CPU上的perf-counter中断的合理思维模型是，它丢弃尚未分配给执行单元的所有uops.但是已经分派的ALU微指令已经通过管道退役(如果没有丢弃任何年轻的微指令)而不是中止，这是有道理的，因为对于sqrtpd，最大额外延迟为〜16个周期，并且刷新存储队列可能会花费更长的时间. (已退役的未决商店无法回滚).有关尚未退货的货品/商店的IDK；至少负载可能被丢弃了.

I think a reasonable mental model for perf-counter interrupts on Intel CPUs is that it discards any uops that haven't yet been dispatched to an execution unit. But ALU uops that have been dispatched already go through the pipeline to retirement (if there aren't any younger uops that got discarded) instead of being aborted, which makes sense because the maximum extra latency is ~16 cycles for sqrtpd, and flushing the store queue can easily take longer than that. (Pending stores that have already retired can't be rolled back). IDK about loads/stores that haven't retired; at least the loads are probably discarded.

我基于这样的事实，即当CPU有时等待它产生输出时，很容易构建不显示divss计数的循环.如果它被丢弃而没有退役，那么它将是恢复中断时的下一条指令，因此(除滑移外)您会看到很多计数.

I'm basing this guess on the fact that it's easy to construct loops that don't show any counts for divss when the CPU is sometimes waiting for it to produce its outputs. If it was discarded without retiring, it would be the next instruction when resuming the interrupt, so (other than skids) you'd see lots of counts for it.

因此， cycles计数的分布向您显示哪些指令花费最多的时间是调度程序中最古老的尚未分配的指令. (或者在前端停顿的情况下，CPU试图获取/解码/发出停顿的指令).请记住，这通常意味着它向您显示正在等待输入的指令，而不是缓慢产生指令的指令.

Thus, the distribution of cycles counts shows you which instructions spend the most time being the oldest not-yet-dispatched instruction in the scheduler. (Or in case of front-end stalls, which instructions the CPU is stalled trying to fetch / decode / issue). Remember, this usually means it shows you the instructions that are waiting for inputs, not the instructions that are slow to produce them.

(嗯，这可能不正确，而且我还没有做太多测试.我通常使用perf stat来查看微基准中整个循环的总体计数，而不是统计资料perf record.addss和mulss的延迟要比andps高，因此，如果我建议的模型正确，您可以期望andps获得计数以等待其xmm5输入.)

(Hmm, this might not be right, and I haven't tested this much. I usually use perf stat to look at overall counts for a whole loop in a microbenchmark, not statistical profiles with perf record. addss and mulss are higher latency than andps, so you'd expect andps to get counts waiting for its xmm5 input if my proposed model was right.)

无论如何，普遍的问题是，同时有多个指令在飞行中，当cycles计数器环绕时，硬件应归咎于谁?

Anyway, the general problem is, with multiple instructions in flight at once, which one does the HW "blame" when the cycles counter wraps around?

请注意，divss生成结果的速度很慢，但仅是单uup指令(与整数div不同，后者在AMD和Intel上是微编码的).如果您没有延迟或延迟的瓶颈，请

Note that divss is slow to produce the result, but is only a single-uop instruction (unlike integer div which is microcoded on AMD and Intel). If you don't bottleneck on its latency or its not-fully-pipelined throughput, it's not slower than mulss because it can overlap with surrounding code just as well.

(divss/divps尚未完全流水线化.例如，在Haswell上，独立的divps可以每7个周期开始.但是每个仅需要10-13个周期即可产生结果.所有其他执行单元是完全流水线化；能够在每个周期对独立数据启动新操作.)

(divss / divps is not fully pipelined. On Haswell for example, an independent divps can start every 7 cycles. But each only takes 10-13 cycles to produce its result. All other execution units are fully pipelined; able to start a new operation on independent data every cycle.)

考虑一个大的循环，该循环会限制吞吐量，而不是任何循环所携带的依赖项的延迟，并且仅需要divss每20条FP指令运行一次.通过常量使用divss而不是使用带有倒数常量的mulss可以使(几乎)性能没有差异. (在实践中，无序调度并不是完美的，较长的依赖链即使在没有循环传输的情况下也会对某些链造成损害，因为它们需要更多的指令才能运行以隐藏所有延迟并维持最大吞吐量. -order-core来查找指令级并行性.)

Consider a large loop that bottlenecks on throughput, not latency of any loop-carried dependency, and only needs divss to run once per 20 FP instructions. Using divss by a constant instead of mulss with the reciprocal constant should make (nearly) no difference in performance. (In practice out-of-order scheduling isn't perfect, and longer dependency chains hurt some even when not loop-carried, because they require more instructions to be in flight to hide all that latency and sustain max throughput. i.e. for the out-of-order core to find the instruction-level parallelism.)

无论如何，这里的要点是divss是单个uop，并且根据周围的代码，它没有为cycles事件获得很多计数是很有意义的.

Anyway, the point here is that divss is a single uop and it makes sense for it not to get many counts for the cycles event, depending on the surrounding code.

您会看到与高速缓存未命中加载相同的效果:加载本身仅在必须等待寻址模式下的寄存器时才获得计数，而依赖链中使用已加载数据的第一条指令将获得一个计数.很多计数.

You see the same effect with a cache-miss load: the load itself mostly only gets counts if it has to wait for the registers in the addressing mode, and the first instruction in the dependency chain that uses the loaded data gets a lot of counts.

您的个人资料结果可能会告诉我们:

divss不必等待其输入准备就绪. (divss之前的movaps %xmm3, %xmm5有时会花费一些时间，但是divss却不会.)

The divss isn't having to wait for its inputs to be ready. (The movaps %xmm3, %xmm5 before the divss sometimes takes some cycles, but the divss never does.)

我们可能接近divss

在divss之后涉及xmm5的依赖链得到了一些计数.乱序执行必须能够一次使多个独立的迭代保持运行状态.

The dependency chain involving xmm5 after divss is getting some counts. Out-of-order execution has to work to keep multiple independent iterations of that in flight at once.

maxss/movaps循环依赖项链可能是一个重要的瓶颈. (特别是如果您在Skylake上，divss吞吐量是每3个时钟之一，但是maxss延迟是4个周期.端口0和1竞争引起的资源冲突将延迟maxss.)

The maxss / movaps loop-carried dependency chain may be a significant bottleneck. (Especially if you're on Skylake where divss throughput is one per 3 clocks, but maxss latency is 4 cycles. And resource conflicts from competition for ports 0 and 1 will delay maxss.)

movaps的高计数可能是由于它紧随maxss之后，在显示的循环部分中形成了唯一的循环携带依赖性.因此，maxss确实很难产生结果.但是，如果它确实是一个循环传输的dep链是主要瓶颈，那么您会期望maxss本身有很多计数，因为它将等待上一次迭代的输入.

The high counts for movaps might be due to it following maxss, forming the only loop-carried dependency in the part of the loop you show. So it's plausible that maxss really is slow to produce results. But if it really was a loop-carried dep chain that was the major bottleneck, you'd expect to see lots of counts on maxss itself, as it would be waiting for its input from the last iteration.

但是也许消除运动是特殊的"，并且出于某种原因所有计数都计入movaps了吗?在Ivybridge和更高版本的CPU上，注册副本不需要执行单元，而是在管道的发布/重命名阶段处理.

But maybe mov-elimination is "special", and all the counts for some reason get charged to movaps? On Ivybridge and later CPUs, register copies doesn't need an execution unit, but instead are handled in the issue/rename stage of the pipeline.

这篇关于Xcode仪器拆卸时间分析的可靠性的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Xcode仪器拆卸时间分析的可靠性 [英] Reliability of Xcode Instrument's disassembly time profiling

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Xcode仪器拆卸时间分析的可靠性 [英] Reliability of Xcode Instrument&#39;s disassembly time profiling

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Xcode仪器拆卸时间分析的可靠性 [英] Reliability of Xcode Instrument's disassembly time profiling

登录关闭