为什么在重复调用 clock_gettime 时会看到 400x 异常值计时? [英] Why do I see 400x outlier timings when calling clock_gettime repeatedly?

查看：33 发布时间：2022/1/6 13:04:18 c++ linux performance x86 clock

本文介绍了为什么在重复调用 clock_gettime 时会看到 400x 异常值计时?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用物理时钟来测量 C++ 中某些命令的执行时间，但是我遇到了一个问题，即从计算机上的物理时钟读取测量值的过程可能需要很长时间.代码如下:

I'm trying to measure execution time of some commands in c++ by using the physical clock, but I have run into a problem that the process of reading off the measurement from the physical clock on the computer can take a long time. Here is the code:

#include <string>
#include <cstdlib>
#include <iostream>
#include <math.h>
#include <time.h>

int main()
{
      int64_t mtime, mtime2, m_TSsum, m_TSssum, m_TSnum, m_TSmax;
      struct timespec t0;
      struct timespec t1;
      int i,j;
      for(j=0;j<10;j++){
      m_TSnum=0;m_TSsum=0; m_TSssum=0; m_TSmax=0;
      for( i=0; i<10000000; i++) {
            clock_gettime(CLOCK_REALTIME,&t0);
            clock_gettime(CLOCK_REALTIME,&t1);
            mtime = (t0.tv_sec * 1000000000LL + t0.tv_nsec);
            mtime2= (t1.tv_sec * 1000000000LL + t1.tv_nsec);

            m_TSsum += (mtime2-mtime);
            m_TSssum += (mtime2-mtime)*(mtime2-mtime);
            if( (mtime2-mtime)> m_TSmax ) { m_TSmax = (mtime2-mtime);}
            m_TSnum++;
      }
      std::cout << "Average "<< (double)(m_TSsum)/m_TSnum
            << " +/- " << floor(sqrt( (m_TSssum/m_TSnum  - ( m_TSsum/m_TSnum ) *( m_TSsum/m_TSnum ) ) ) )
            << " ("<< m_TSmax <<")" <<std::endl;
      }
}

接下来我在专用内核上运行它(或者系统管理员告诉我的)，以避免进程被调度程序移动到后台的任何问题:

Next I run it on a dedicated core (or so the sysadmin tells me), to avoid any issues with process being moved to background by scheduler:

$ taskset -c 20 ./a.out

这是我得到的结果:

Average 18.0864 +/- 10 (17821)
Average 18.0807 +/- 8 (9116)
Average 18.0802 +/- 8 (8107)
Average 18.078 +/- 6 (7135)
Average 18.0834 +/- 9 (21240)
Average 18.0827 +/- 8 (7900)
Average 18.0822 +/- 8 (9079)
Average 18.086 +/- 8 (8840)
Average 18.0771 +/- 6 (5992)
Average 18.0894 +/- 10 (15625)

很明显，调用 clock_gettime() 大约需要 18 纳秒(在这个特定的服务器上)，但我不明白为什么最大"时间似乎在 300 到 1000 次之间更长?

So clearly it takes about 18 nanosecond (on this particular server) to call clock_gettime(), but what I can't understand why the "max" time seems to be between 300 and 1000 times longer?

如果我们假设内核真正专用于此进程而不被其他东西使用(这可能是也可能不是；当不在专用内核上运行时，平均时间是相同的，但 sd/max 是稍微大一点)，还有什么可能导致这些减速"(因为没有更好的名字)?

If we assume that the core is truly dedicated to this process and not used by something else (which may or may not be true; when not running on dedicated core, the average time is the same, but the sd/max are somewhat bigger), what else could cause these "slowdowns" (for the lack of a better name)?

为什么会出现异常值?

当您在两次 clock_gettime 调用中迭代 1000 万次时，您可能会看到异常事件(和非异常变化)，这有许多与软件和硬件相关的原因.这些原因包括:

Why Outliers?

There are many software and hardware related reasons why you might see outlier events (and non-outlier variation) when you iterate 10 million times over two clock_gettime calls. These reasons include:

上下文切换:调度程序可能会决定在 CPU 之间迁移您的进程，即使您将进程固定到 CPU，操作系统也可能会定期决定在您的逻辑 CPU 上运行其他.
SMT:假设这是在具有 SMT 的 CPU 上(例如，x86 上的超线程)调度程序可能会定期在同级核心(与您的进程相同的物理核心)上安排一些事情.这会显着影响代码的整体性能，因为两个线程正在竞争相同的核心资源.此外，在 SMT 和非 SMT 执行之间可能存在一个过渡期，在该过渡期中什么也不会执行，因为在 SMT 执行开始时，内核必须重新分区一些资源.
中断:典型系统每秒至少会收到数百个中断，来自网卡、图形设备、硬件时钟、系统定时器、音频设备、IO 设备、跨 CPU IPI 等.尝试 watch -n1 cat/proc/interrupts 并查看您可能认为是空闲系统的操作是如何发生的.
硬件暂停:CPU 本身可能会因各种原因(例如功率或热节流)或仅仅因为 CPU 正在经历频率转换.
系统管理模式:除了操作系统看到和处理的中断之外，x86 CPU 还具有一种隐藏中断"它允许 SMM 功能在您的 CPU 上执行，唯一明显的影响是用于测量实时的循环计数器中的周期性意外跳跃.
正常的性能变化:您的代码不会每次都以完全相同的方式执行.初始迭代将遭受数据和指令缓存未命中，并且对分支方向等事物具有未经训练的预测器.即使处于明显的稳定状态"您可能仍会因无法控制的事情而导致性能变化.
不同的代码路径:你可能希望你的循环每次通过¹执行完全相同的指令:毕竟，没有什么真正改变，对吧?好吧，如果您深入研究 clock_gettime 的内部结构，您很可能会发现某些分支在发生某些溢出时会采用不同的路径，或者从 VDSO 竞赛中的调整因子读取更新时，等

Context switches: the scheduler may decide to migrate your process between CPUs, and even if you pin your process to a CPU, the OS may periodically decide to run something else on your logical CPU.
SMT: assuming this is on a CPU with SMT (e.g., hyperthreading on x86) the scheduler will probably periodically schedule something on the sibling core (same physical core as your process). This can dramatically affect the overall performance of your code since two threads are competing for the same core resources. Futhermore, there is probably a transition period between SMT and non-SMT execution where nothing executes since the core has to re-paritition some resources when SMT execution begins.
Interrupts: A typical system will receiving hundreds of interrupts per second at a minimum, from the network card, graphics devices, hardware clocks, system timers, audio devices, IO devices, cross-CPU IPIs, and so on. Try a watch -n1 cat /proc/interrupts and see how action is occurring on what you might think is an otherwise idle system.
Hardware pauses: the CPU itself may periodically stop executing instructions for a variety of reasons such as power or thermal throttling, or just because the CPU is undergoing a frequency transition.
System Management Mode: totally apart from interrupts seen and handled by the OS, x86 CPUs have a type of "hidden interrupt" which allows SMM functionality to execute on your CPU, with the only apparent affect being periodic unexpected jumps in cycle counters used to measure real time.
Normal performance variations: your code won't execute in exactly the same way every time. Initial iterations will suffer data and instruction cache misses, and have untrained predictors for things like branch direction. Even in an apparent "steady state" you may still suffer performance variations from things beyond your control.
Different code paths: you might expect your loop to execute exactly the same instructions every time through¹: after all, nothing is really changing, right? Well if you dig into the internals of clock_gettime you may very well find something branches that take a different path when some times of overflow occurs, or when reading from the adjustment factors in the VDSO races with an update, etc.

这甚至不是一个完整的列表，但它至少应该让您了解一些可能导致异常值的因素.您可以消除或减少其中一些的影响，但在 x86 上的现代非实时² 操作系统上，完全控制通常是不可能的.

That's not even a comprehensive list, but it should at least give you a taste of some of the factors that can cause outliers. You can eliminate or reduce the effect of some of these, but complete control is generally impossible on a modern non-realtime² OS on x86.

如果我不得不根据一个典型异常值进行猜测，大约 8000 ns，这对于上下文切换中断来说可能太小了，您可能会看到处理器频率缩放的影响到可变 TurboBoost 比率.这是一口，但基本上现代 x86 芯片以不同的最大涡轮增压"运行.速度取决于有多少核心处于活动状态.例如，我的 i7-6700HQ 在 1 个核心处于活动状态时以 3.5 GHz 运行，但如果 2、3 或 4 个核心处于活动状态，则分别只有 3.3、3.2 或 3.1 GHz.

If I had to take a guess, based on a typical outlier of ~8000 ns, which is probably too small for a context switch interruption, you are probably seeing the effect of processor frequency scaling due to variable TurboBoost ratios. That's a mouthful, but basically modern x86 chips run at different "max turbo" speeds depending on how many cores are active. My i7-6700HQ, for example, will run at 3.5 GHz if one core is active, but only 3.3, 3.2 or 3.1 GHz if 2, 3 or 4 cores are active, respectively.

这意味着即使您的进程从不中断，任何在另一个 CPU 上运行甚至短暂运行的工作都可能导致频率转换(例如，因为您从 1 个活动核心转换到 2 个活动核心)，并且在这样的转换期间，CPU 空闲数千个周期，同时电压稳定.您可以在本答案中找到一些详细的数字和测试，但结果是，在经过测试的 CPU 上，稳定性大约需要 20,000 个周期，非常符合您观察到的 ~8000 纳秒的异常值.有时，您可能会在一个时期内获得两次转换，这会使影响加倍，依此类推.

This means that even if your process is never interrupted, any work at all which runs even briefly on another CPU may cause a frequency transition (e.g., because you transition fromm 1 to 2 active cores), and during such a transition the CPU is idled for thousands of cycles while voltages stabilize. You can find some detailed numbers and tests in this answer but the upshot is that on the tested CPU the stabilization takes roughly 20,000 cycles, very much in line with your observed outliers of ~8000 nanoseconds. Sometimes you might get two transitions in a period which doubles the impact, and so on.

如果您仍然想知道异常值的原因，您可以采取以下步骤并观察对异常值行为的影响.

If you still want to know the cause of your outliers, you can take the following steps and observe the effect on the outlier behavior.

首先，您应该收集更多数据.不是仅仅重新编码超过 10,000,000 次迭代的最大值，您应该收集一个具有一些合理桶大小的直方图(比如 100 ns，或者更好的某种几何桶大小，在更短的时间内提供更高的分辨率).这将是一个巨大的帮助，因为您将能够准确地看到时间聚集的位置:除了您用max"记录的 6000 - 17000 ns 异常值之外，您完全有可能有其他影响，并且它们可以有不同的原因.

First, you should collect more data. Rather than just recoding the max over 10,000,000 iterations, you should collect a histogram with some reasonable bucket size (say 100 ns, or even better some type of geometric bucket size that gives higher resolution for shorter times). This will be a huge help because you'll be able to see exactly where the times are clustering: it is entirely possible that you have other effects other than the 6000 - 17000 ns outliers that you note with "max", and they can have different causes.

直方图还可以让您了解异常值频率，您可以将其与可以测量的事物的频率相关联，以查看它们是否匹配.

A histogram also lets you understand the outlier frequency, which you can correlate with frequencies of things you can measure to see if they match up.

现在添加直方图代码也可能会给计时循环增加更多的差异，因为(例如)您将根据计时值访问不同的缓存行，但这是可以管理的，尤其是因为时间的记录发生在定时区域"之外.

Now adding the histogram code also potentially adds more variance to the timing loop, since (for example) you'll be accessing different cache lines depending on the timing value, but this is manageable, especially because the recording of the time happens outside the "timed region".

有了这个，您可以尝试系统地检查我上面提到的问题，看看它们是否是原因.以下是一些想法:

With that in hand, you can try to systematically check the issues I mentioned above to see if they are the cause. Here are some ideas:

超线程:只需在运行单线程基准测试时在 BIOS 中将其关闭，即可一次性消除整类问题.总的来说，我发现这也会导致细粒度基准方差的大幅减少，因此这是一个很好的第一步.

Hyperthreading: Just turn it off in the BIOS while running single-threaded benchmarks which eliminates that whole class of issues in one move. In general, I've found that this also leads to a giant reduction in fine-grained benchmark variance, so it's a good first step.

频率缩放:在 Linux 上，您通常可以通过将性能调控器设置为性能"来禁用次标称频率缩放.如果您使用的是 intel_pstate，您可以通过将 /sys/devices/system/cpu/intel_pstate/no_turbo 设置为 0 来禁用超标称(又名 turbo) 驱动程序.如果您有其他驱动程序，您还可以直接通过 MSR 操纵涡轮增压模式，或者如果所有其他方法都失败，您可以在 BIOS 中执行此操作.在链接问题中，当涡轮被禁用时，异常值基本上消失了，所以这是首先要尝试的.

Frequency scaling: On Linux, you can usually disable sub-nominal frequency scaling by setting the performance governor to "performance". You can disable super-nominal (aka turbo) by setting /sys/devices/system/cpu/intel_pstate/no_turbo to 0 if you're using the intel_pstate driver. You can also manipulate the turbo mode directly via MSR if you have another driver, or you can do it in the BIOS if all else fails. In the linked question the outliers basically disapear when turbo is disabled, so that's something to try first.

假设您确实希望在生产中继续使用 turbo，您可以手动将最大 turbo 比率限制为适用于 N 个内核(例如，2 个内核)的某个值，然后使其他 CPU 离线，以便最多该数量的核心将永远处于活动状态.然后，无论有多少内核处于活动状态，您都可以一直以新的最大睿频运行(当然，在某些情况下，您可能仍会受到功率、电流或热限制).

Assuming you actually want to keep using turbo in production, you can limit the max turbo ratio manually to some value that applies to N cores (e.g,. 2 cores), and then offline the other CPUs so at most that number of cores will ever be active. Then you'll be able to run at your new max turbo all the time no matter how many cores are active (of course, you might still be subject to power, current or thermal limits in some cases).

中断:您可以搜索中断关联"；尝试将中断移入/移出固定核心并查看对异常值分布的影响.您还可以计算中断的数量(例如，通过 /proc/interrupts)并查看计数足以解释异常值计数.如果您发现定时器中断是具体的原因，您可以探索各种tickless"中断.(又名NOHZ")模式您的内核提供以减少或消除它们.您还可以通过 x86 上的 HW_INTERRUPTS.RECEIVED 性能计数器直接计算它们.

Interrupts: you can search for "interrupt affinity" to try to move interrupts to/from your pinned core and see the effect on the outlier distribution. You can also count the number of interrupts (e.g., via /proc/interrupts) and see the count is enough to explain the outlier count. If you find that timer interrupts specifically are the cause, you can explore the various "tickless" (aka "NOHZ") modes your kernel offers to reduce or eliminate them. You can also count them directly via the HW_INTERRUPTS.RECEIVED performance counter on x86.

上下文切换:您可以使用实时优先级或 isolcpus 以防止其他进程在您的 CPU 上运行.请记住，上下文切换问题虽然通常被定位为主要/唯一问题，但实际上相当罕见:最多它们通常以 HZ 速率发生(在现代内核上通常为 250/秒) - 但是在大部分空闲的系统上，调度程序实际上决定在繁忙的 CPU 上调度另一个进程是罕见的.如果您缩短基准循环，通常几乎可以完全避免上下文切换.

Context switches: you can use realtime priorities or isolcpus to prevent other processes from running on your CPU. Keep in mind that context switch issues, while usually positioned as the main/only issue, are actually fairly rare: at most they generally happen at the HZ rate (often 250/second on modern kernels) - but it will be rare on a mostly idle system that the scheduler would actually decide to scheduler another process on your busy CPU. If you make your benchmark loops short, you can generally almost entirely avoid context switches.

与代码相关的性能变化:您可以使用各种分析工具(如 perf)检查是否发生这种情况.您可以仔细设计数据包处理代码的核心，以避免出现缓存未命中等异常事件，例如，通过预先接触缓存行，并尽可能避免使用复杂度未知的系统调用.

Code related performance variations: you can check if this is happening with various profiling tools like perf. You can carefully design the core of your packet handling code to avoid outlier events like cache misses, e.g., by pre-touching caching lines, and you could avoid the use of system calls with unknown complexity as much as possible.

虽然上面的一些内容纯粹是出于调查目的，但其中许多内容既可以帮助您确定导致停顿的原因，也可以减轻停顿的影响.

While some of the above are purely for investigative purposes, many of them will both help you determine what's causing the pauses and also mitigate them.

不过，我不知道所有问题的缓解措施 - 像 SMM 这样的东西，您可能需要专门的硬件或 BIOS 来避免.

I'm not aware of mitigations for all issues however - stuff like SMM you'd perhaps need specialized hardware or BIOS to avoid.

¹ 好吧，除了可能在 if( (mtime2-mtime)> m_TSmax ) 条件被触发的情况下 - 但这应该很少见(也许你的编译器使其无分支，在这种情况下只有一个执行路径).

¹ Well except perhaps in the case that the if( (mtime2-mtime)> m_TSmax ) condition is triggered - but this should be rare (and perhaps your compiler has made it branch-free, in which case there is only one execution path).

² 实际上不清楚您是否可以达到零方差".即使使用硬实时操作系统:一些特定于 x86 的因素，如 SMM 模式和 DVFS 相关停顿似乎是不可避免的.

² It's not actually clear you can get to "zero variance" even with a hard realtime OS: some x86-specific factors like SMM mode and DVFS related stalls seem unavoidable.

这篇关于为什么在重复调用 clock_gettime 时会看到 400x 异常值计时?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么在重复调用 clock_gettime 时会看到 400x 异常值计时? [英] Why do I see 400x outlier timings when calling clock_gettime repeatedly?

问题描述

推荐答案

为什么会出现异常值?

Why Outliers?

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

为什么在重复调用 clock_gettime 时会看到 400x 异常值计时? [英] Why do I see 400x outlier timings when calling clock_gettime repeatedly?

问题描述

推荐答案

为什么会出现异常值?

Why Outliers?

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭