循环计数本身在程序计时上是否可靠? [英] Is cycle count itself reliable on program timing?

查看:74
本文介绍了循环计数本身在程序计时上是否可靠?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试开发一种判断系统,该系统不仅可以衡量时间和内存使用情况,还可以衡量更深层次的信息,例如缓存未命中等,我认为硬件计数器(使用 perf)非常适合它.

>

但是对于计时部分,我想知道纯粹使用循环计数来确定执行速度是否足够可靠?希望了解这个决定的利弊.

解决方案

所以您建议测量 CPU 周期,而不是秒?听起来有些道理.

对于一些很好的微基准测试,主要是考虑了由于 CPU 频率变化引起的变化.(如果您只计算用户空间周期,如果您正在微基准测试一个不进行系统调用的循环,则由于中断而导致的延迟.只有中断的次要影响是可见的,即序列化管道并可能驱逐您的一些来自缓存/TLB 的数据.)

但是当 CPU 频率发生变化时,内存(可能还有 L3 缓存)保持恒定速度,因此缓存未命中的相对成本发生变化:以纳秒为单位的相同响应时间是更少的核心时钟周期,所以乱序的 exec 可以更容易地隐藏更多.可用内存带宽相对于内核可以使用的带宽更高.因此硬件预取更容易跟上.

例如在 4.3GHz 时,在 L2 缓存中丢失但在 Skylake 服务器上的 L3 中命中的负载可能具有大约 79 个核心时钟周期的总延迟.(https://www.7-cpu.com/cpu/Skylake_X.html - i7-7820X (Skylake X),8 核.

在 800MHz 空闲时钟速度下,L2 缓存未命中仍然是 14 个周期(因为它以核心速度运行).但是,如果另一个内核将 L3 缓存(以及一般的非内核)保持在高时钟速度,那么往返请求的非内核部分将花费更少的内核时钟周期.

例如我们可以通过假设 L3 命中与 L2 命中的所有额外时间都花在非核心而不是核心中来进行粗略计算,并且需要固定的纳秒数.由于我们有 4.3GHz 时钟周期的时间,数学计算为 14 + (79-14)*8/43 个周期,L3 命中 800MHz = 26 个周期,低于 79.

这个粗略的计算实际上与 7-cpu.com 的数字相符,该 CPU 的核心频率为 3.6GHz:L3 Cache Latency = 68 个周期.14 + (79-14)*36/43 = 68.4.

请注意,我选择了一个服务器";部分原因是不同的内核可以以不同的时钟速度运行.客户"中的情况并非如此.像 i7-6700k 这样的 CPU.非核心(L3、互连等)可能仍然能够独立于核心而变化,例如对 GPU 保持高位.此外,服务器部件在核心之外具有更高的延迟.(例如,禁用 Turbo 的 4GHz Skylake i7-6700k 的 L3 延迟仅为 42 个核心时钟周期,而不是 68 或 79.)

另见 为什么 Skylake 在单线程内存吞吐量方面比 Broadwell-E 好得多? 为什么/如何 L3 和内存延迟影响最大可能的单核内存带宽.


当然,如果您通过允许一些预热来控制 CPU 频率,或者对于运行时间超过微不足道的任务,这没什么大不了的.>

(尽管请注意,Skylake 有时会在非常受内存限制的情况下降低时钟速度,不幸的是,在默认 energy_performance_preference = balance_power 下,这会进一步损害带宽,但是balance_performance"或performance"可以避免这种情况.通过施加内存压力来降低 CPU 频率)

请注意,仅计算周期不会消除上下文切换的成本(切换回此线程后额外的缓存未命中,并且耗尽 ROB 很糟糕).或者来自其他内核的内存带宽竞争.

例如在同一个物理核心的另一个逻辑核心上运行的另一个线程通常会严重降低 IPC.总体吞吐量通常会增加一些,具体取决于任务,但单个线程的吞吐量会下降.

Skylake 有一个用于跟踪超线程竞争的 perf 事件:cpu_clk_thread_unhalted.one_thread_active - IIRC,当您的任务正在运行并且拥有全部内核时,事件计数以大约 24MHz 的速度递增.因此,如果您看到的比这少,您就知道您遇到了一些竞争,并且花了一些时间将 ROB 分区并与另一个线程交易前端周期.


所以有很多效果,是否有用由您决定.按核心时钟周期对结果进行排序听起来很合理,但您可能应该包括 CPU 秒数(任务时钟)和结果中的平均频率,以帮助人们发现异常值/故障.

I am currently trying to develop a judging system that measure not only time and memory use but also more deeper information such as cache misses and etc., which I assume the hardware counters (using perf) are perfect for it.

But for the timing part, I wonder if using purely the cycle count to determine execution speed is reliable enough? Hope to know about the pros and cons about this decision.

解决方案

So you're proposing measuring CPU cycles, instead of seconds? Sounds somewhat reasonable.

For some microbenchmarks that's good, and mostly factors out the variations due to CPU frequency changes. (And delays due to interrupts if you count only user-space cycles, if you're microbenching a loop that doesn't make system calls. Only the secondary effects of interrupts are then visible, i.e. serializing the pipeline and perhaps evicting some of your data from cache / TLB.)

But the memory (and maybe L3 cache) stay at constant speed while CPU frequency changes, so the relative cost of a cache miss changes: The same response time in nanoseconds is fewer core clock cycles, so out-of-order exec can hide more of it more easily. And available memory bandwidth is higher relative to what a core can use. So HW prefetch has an easier time keeping up.

e.g. at 4.3GHz, a load that missed in L2 cache but hits in L3 on Skylake-server might have a total latency of about 79 core clock cycles. (https://www.7-cpu.com/cpu/Skylake_X.html - i7-7820X (Skylake X), 8 cores).

At 800MHz idle clock speed, an L2 cache miss is still 14 cycles (because it runs at core speed). But if another core is keeping the L3 cache (and the uncore in general) at high clock speed, the off-core part of that round-trip request will take many fewer core clock cycles.

e.g. we can make a back-of-the-envelope calculation by assuming that all the extra time for an L3 hit vs. an L2 hit is spent in the uncore, not the core, and takes a fixed number of nanoseconds. Since we have that time in cycles of a 4.3GHz clock, the math works out as 14 + (79-14)*8/43 cycles for an L3 hit at 800MHz = 26 cycles, down from 79.

This rough calculation actually matches up with the 7-cpu.com numbers for the same CPU with a core at 3.6GHz: L3 Cache Latency = 68 cycles. 14 + (79-14)*36/43 = 68.4.

Note that I picked a "server" part because different cores can run at different clock speeds. That's not the case in "client" CPUs like i7-6700k. Uncore (L3, interconnect, etc.) may still be able to vary independently of the cores, e.g. staying high for the GPU. Also, server parts have higher latency outside the core. (e.g. 4GHz Skylake i7-6700k with turbo disabled has L3 latency of only 42 core clock cycles, not 68 or 79.)

See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? for why/how L3 and memory latency affect max possible single-core memory bandwidth.


Of course, if you control the CPU frequency by allowing some warm-up, or for tasks that run for more than a trivial amount of time, this isn't a big deal.

(Although do note that Skylake will sometimes lower the clock speed when very memory-bound, which unfortunately hurts bandwidth even more, at the default energy_performance_preference = balance_power, but "balance_performance" or "performance" can avoid that. Slowing down CPU Frequency by imposing memory stress)

Do note that counting only cycles won't remove the cost of context switches (extra cache misses after switching back to this thread, and draining the ROB sucks). Or of competition from other cores for memory bandwidth.

e.g. another thread running on the other logical core of the same physical core will often seriously reduce IPC. Overall throughput usually goes up some, depending on the task, but individual per-thread throughput goes down.

Skylake has a perf event for tracking hyperthreading competition: cpu_clk_thread_unhalted.one_thread_active - IIRC that event count increments at something like 24MHz when your task is running and has the core all to itself. So if you see less than that, you know you had some competition and spent some time with the ROB partitioned and trading front-end cycles with another thread.


So there are a bunch of effects, and it's up to you to decide whether it's useful. Sorting results by core clock cycles sounds reasonable, but you should probably include CPU-seconds (task-clock) and average-frequency in the results to help people spot outliers / glitches.

这篇关于循环计数本身在程序计时上是否可靠?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆