为什么反复调用clock_gettime时会看到400倍的异常时间? [英] Why do I see 400x outlier timings when calling clock_gettime repeatedly?

查看:568
本文介绍了为什么反复调用clock_gettime时会看到400倍的异常时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过使用物理时钟来测量c ++中某些命令的执行时间,但是我遇到了一个问题,即从计算机上的物理时钟读取测量值的过程可能会花费很长时间.这是代码:

I'm trying to measure execution time of some commands in c++ by using the physical clock, but I have run into a problem that the process of reading off the measurement from the physical clock on the computer can take a long time. Here is the code:

#include <string>
#include <cstdlib>
#include <iostream>
#include <math.h>
#include <time.h>

int main()
{
      int64_t mtime, mtime2, m_TSsum, m_TSssum, m_TSnum, m_TSmax;
      struct timespec t0;
      struct timespec t1;
      int i,j;
      for(j=0;j<10;j++){
      m_TSnum=0;m_TSsum=0; m_TSssum=0; m_TSmax=0;
      for( i=0; i<10000000; i++) {
            clock_gettime(CLOCK_REALTIME,&t0);
            clock_gettime(CLOCK_REALTIME,&t1);
            mtime = (t0.tv_sec * 1000000000LL + t0.tv_nsec);
            mtime2= (t1.tv_sec * 1000000000LL + t1.tv_nsec);

            m_TSsum += (mtime2-mtime);
            m_TSssum += (mtime2-mtime)*(mtime2-mtime);
            if( (mtime2-mtime)> m_TSmax ) { m_TSmax = (mtime2-mtime);}
            m_TSnum++;
      }
      std::cout << "Average "<< (double)(m_TSsum)/m_TSnum
            << " +/- " << floor(sqrt( (m_TSssum/m_TSnum  - ( m_TSsum/m_TSnum ) *( m_TSsum/m_TSnum ) ) ) )
            << " ("<< m_TSmax <<")" <<std::endl;
      }
}

接下来,我将在专用内核上运行它(或者是sysadmin告诉我),以避免进程将任何问题由调度程序转移到后台:

Next I run it on a dedicated core (or so the sysadmin tells me), to avoid any issues with process being moved to background by scheduler:

$ taskset -c 20 ./a.out

这是我得到的结果:

Average 18.0864 +/- 10 (17821)
Average 18.0807 +/- 8 (9116)
Average 18.0802 +/- 8 (8107)
Average 18.078 +/- 6 (7135)
Average 18.0834 +/- 9 (21240)
Average 18.0827 +/- 8 (7900)
Average 18.0822 +/- 8 (9079)
Average 18.086 +/- 8 (8840)
Average 18.0771 +/- 6 (5992)
Average 18.0894 +/- 10 (15625)

因此,显然,在此特定服务器上调用clock_gettime()大约需要18纳秒,但是我不明白为什么最长"时间似乎要长300到1000倍?

So clearly it takes about 18 nanosecond (on this particular server) to call clock_gettime(), but what I can't understand why the "max" time seems to be between 300 and 1000 times longer?

如果我们假设内核是真正专用于此过程的,并且未被其他人使用(这可能是正确的,也可能是不正确的;当不在专用内核上运行时,平均时间是相同的,但是sd/max是更大),还有什么会导致这些速度变慢"(由于缺少更好的名称)?

If we assume that the core is truly dedicated to this process and not used by something else (which may or may not be true; when not running on dedicated core, the average time is the same, but the sd/max are somewhat bigger), what else could cause these "slowdowns" (for the lack of a better name)?

推荐答案

为什么离群值?

有很多与软件和硬件相关的原因,当您在两个clock_gettime调用中迭代1000万次时,可能会看到异常事件(和非异常变化).这些原因包括:

Why Outliers?

There are many software and hardware related reasons why you might see outlier events (and non-outlier variation) when you iterate 10 million times over two clock_gettime calls. These reasons include:

  • 上下文切换:调度程序可能会决定在CPU之间迁移您的进程,即使您将进程固定到CPU,操作系统也可能会定期决定在逻辑CPU上运行其他. /li>
  • SMT :假设它在具有SMT的CPU上(例如,x86上的超线程),调度程序可能会定期在同级内核(与您的进程相同的物理内核)上调度某些内容.这可能会极大地影响代码的整体性能,因为两个线程正在争夺相同的核心资源.此外,在SMT和非SMT执行之间可能会有一个过渡期,在此期间什么都不执行,因为核心必须在SMT执行开始时重新分配一些资源.
  • 中断:典型的系统每秒至少会从网卡,图形设备,硬件时钟,系统计时器,音频设备,IO设备,跨CPU IPI等接收数百个中断.尝试使用watch -n1 cat /proc/interrupts并查看在您认为是空闲的系统上如何进行操作.
  • 硬件暂停:CPU本身可能会出于各种原因(例如电源或热节流)或仅由于 CPU正在经历频率转换.
  • 系统管理模式:与操作系统看到和处理的中断完全不同,x86 CPU具有一种隐藏中断"类型,它允许SMM功能在您的CPU上执行,唯一明显的影响是用于测量实时的周期计数器中的周期性意外跳转.
  • 正常的性能差异:您的代码不会每次都以完全相同的方式执行.初始迭代将遭受数据和指令高速缓存未命中的折磨,并对诸如分支方向之类的东西具有未经训练的预测器.即使处于明显的稳定状态",您也可能会因无法控制的情况而遭受性能变化的影响.
  • 不同的代码路径:您可能希望循环每次通过 1 执行完全相同的指令:毕竟,什么都没有真正改变,对吗?好吧,如果您深入研究clock_gettime的内部原理,则很可能会发现某些分支,这些分支在发生某些溢出时或通过更新等从VDSO竞赛中的调整因子读取时,采取了不同的路径.
  • Context switches: the scheduler may decide to migrate your process between CPUs, and even if you pin your process to a CPU, the OS may periodically decide to run something else on your logical CPU.
  • SMT: assuming this is on a CPU with SMT (e.g., hyperthreading on x86) the scheduler will probably periodically schedule something on the sibling core (same physical core as your process). This can dramatically affect the overall performance of your code since two threads are competing for the same core resources. Futhermore, there is probably a transition period between SMT and non-SMT execution where nothing executes since the core has to re-paritition some resources when SMT execution begins.
  • Interrupts: A typical system will receiving hundreds of interrupts per second at a minimum, from the network card, graphics devices, hardware clocks, system timers, audio devices, IO devices, cross-CPU IPIs, and so on. Try a watch -n1 cat /proc/interrupts and see how action is occurring on what you might think is an otherwise idle system.
  • Hardware pauses: the CPU itself may periodically stop executing instructions for a variety of reasons such as power or thermal throttling, or just because the CPU is undergoing a frequency transition.
  • System Management Mode: totally apart from interrupts seen and handled by the OS, x86 CPUs have a type of "hidden interrupt" which allows SMM functionality to execute on your CPU, with the only apparent affect being periodic unexpected jumps in cycle counters used to measure real time.
  • Normal performance variations: your code won't execute in exactly the same way every time. Initial iterations will suffer data and instruction cache misses, and have untrained predictors for things like branch direction. Even in an apparent "steady state" you may still suffer performance variations from things beyond your control.
  • Different code paths: you might expect your loop to execute exactly the same instructions every time through1: after all, nothing is really changing, right? Well if you dig into the internals of clock_gettime you may very well find something branches that take a different path when some times of overflow occurs, or when reading from the adjustment factors in the VDSO races with an update, etc.

这甚至不是一个完整的列表,但是它至少应该让您了解一些可能导致异常值的因素.您可以消除或减少其中的 some 效果,但是在x86上的现代非实时 2 操作系统上,通常无法完全控制.

That's not even a comprehensive list, but it should at least give you a taste of some of the factors that can cause outliers. You can eliminate or reduce the effect of some of these, but complete control is generally impossible on a modern non-realtime2 OS on x86.

如果我不得不基于大约8000 ns的典型异常值进行猜测(对于上下文切换中断来说可能太小了),您可能会看到处理器频率缩放的影响可变的TurboBoost比率.这是一个令人吃惊的,但是基本上现代的x86芯片以活动的核心数量不同,以不同的最大加速"速度运行.例如,如果一个内核处于活动状态,我的i7-6700HQ可以在3.5 GHz下运行,而如果有2、3或4个内核处于活动状态,则只能分别在3.3、3.2或3.1 GHz下运行.

If I had to take a guess, based on a typical outlier of ~8000 ns, which is probably too small for a context switch interruption, you are probably seeing the effect of processor frequency scaling due to variable TurboBoost ratios. That's a mouthful, but basically modern x86 chips run at different "max turbo" speeds depending on how many cores are active. My i7-6700HQ, for example, will run at 3.5 GHz if one core is active, but only 3.3, 3.2 or 3.1 GHz if 2, 3 or 4 cores are active, respectively.

这意味着,即使您的进程永不中断,即使在另一个CPU上短暂运行的任何工作都可能导致频率转换(例如,因为您从m 1转换为2个活动核心) ,并且在这种过渡期间,CPU处于空闲状态数千个周期,同时电压稳定下来.您可以在此答案中找到一些详细的数字和测试 ,但是结果是,在经过测试的CPU上,稳定大约需要20,000个周期,非常符合您观察到的约8000纳秒的异常值.有时,您可能会在一段时间内获得两次过渡,从而使影响倍增,依此类推.

This means that even if your process is never interrupted, any work at all which runs even briefly on another CPU may cause a frequency transition (e.g., because you transition fromm 1 to 2 active cores), and during such a transition the CPU is idled for thousands of cycles while voltages stabilize. You can find some detailed numbers and tests in this answer but the upshot is that on the tested CPU the stabilization takes roughly 20,000 cycles, very much in line with your observed outliers of ~8000 nanoseconds. Sometimes you might get two transitions in a period which doubles the impact, and so on.

如果您仍然想了解离群值的原因,可以采取以下步骤并观察对离群值行为的影响.

If you still want to know the cause of your outliers, you can take the following steps and observe the effect on the outlier behavior.

首先,您应该收集更多数据.您不仅应该重新编码超过1000万次迭代的最大值,还应收集具有合理的存储桶大小(例如100 ns,甚至更好的某种类型的几何存储桶大小,可以在更短的时间内提供更高的分辨率)的直方图.这将是巨大的帮助,因为您将能够准确地看到时间在哪里聚类:除了用"max"记下的6000-17000 ns离群值之外,还有完全可能的其他影响,并且它们可以具有不同的原因.

First, you should collect more data. Rather than just recoding the max over 10,000,000 iterations, you should collect a histogram with some reasonable bucket size (say 100 ns, or even better some type of geometric bucket size that gives higher resolution for shorter times). This will be a huge help because you'll be able to see exactly where the times are clustering: it is entirely possible that you have other effects other than the 6000 - 17000 ns outliers that you note with "max", and they can have different causes.

直方图还可以让您了解异常频率,您可以将其与可以测量的事物的频率相关联,以查看它们是否匹配.

A histogram also lets you understand the outlier frequency, which you can correlate with frequencies of things you can measure to see if they match up.

现在添加直方图代码还可能会增加时序循环的方差,因为(例如)您将根据时序值访问不同的缓存行,但这是可管理的,尤其是因为发生了时间记录在定时区域"之外.

Now adding the histogram code also potentially adds more variance to the timing loop, since (for example) you'll be accessing different cache lines depending on the timing value, but this is manageable, especially because the recording of the time happens outside the "timed region".

有了这些,您可以尝试系统地检查我上面提到的问题,看是否是引起问题的原因.这里有一些想法:

With that in hand, you can try to systematically check the issues I mentioned above to see if they are the cause. Here are some ideas:

  1. 超线程:只需在BIOS中运行单线程基准测试时将其关闭,即可一口气消除整个问题.总的来说,我发现这也导致细粒度基准方差的大幅减少,因此这是一个很好的第一步.
  2. 频率缩放:在Linux上,通常可以通过将性能调节器设置为性能"来禁用次标称频率缩放.如果您使用的是intel_pstate驱动程序,则可以通过将/sys/devices/system/cpu/intel_pstate/no_turbo设置为0来禁用超标称(又称Turbo).如果您有其他驱动程序,也可以直接通过MSR 来操作 turbo模式,或者如果其他所有方法都失败,则可以在BIOS中进行操作.在链接的问题中,当禁用turbo时,异常值基本上消失了,因此首先要尝试一下.

  1. Hyperthreading: Just turn it off in the BIOS while running single-threaded benchmarks which eliminates that whole class of issues in one move. In general, I've found that this also leads to a giant reduction in fine-grained benchmark variance, so it's a good first step.
  2. Frequency scaling: On Linux, you can usually disable sub-nominal frequency scaling by setting the performance governor to "performance". You can disable super-nominal (aka turbo) by setting /sys/devices/system/cpu/intel_pstate/no_turbo to 0 if you're using the intel_pstate driver. You can also manipulate the turbo mode directly via MSR if you have another driver, or you can do it in the BIOS if all else fails. In the linked question the outliers basically disapear when turbo is disabled, so that's something to try first.

假设您确实要在生产中继续使用Turbo,则可以将最大Turbo比率手动限制为适用于N个内核(例如2个内核)的某个值,然后使其他CPU脱机,以便最大数量核心将永远处于活动状态.这样一来,无论有多少个内核处于活动状态,您都可以始终在新的max turbo上运行(当然,在某些情况下,您可能仍然会受到功率,电流或热量限制).

Assuming you actually want to keep using turbo in production, you can limit the max turbo ratio manually to some value that applies to N cores (e.g,. 2 cores), and then offline the other CPUs so at most that number of cores will ever be active. Then you'll be able to run at your new max turbo all the time no matter how many cores are active (of course, you might still be subject to power, current or thermal limits in some cases).

尽管上述某些内容纯粹是出于调查目的,但它们中的许多内容都可以帮助您确定造成暂停的原因并缓解暂停情况.

While some of the above are purely for investigative purposes, many of them will both help you determine what's causing the pauses and also mitigate them.

但是我不知道所有问题的缓解措施-像SMM之类的东西,您可能需要专门的硬件或BIOS来避免.

I'm not aware of mitigations for all issues however - stuff like SMM you'd perhaps need specialized hardware or BIOS to avoid.

1 好吧,也许是在触发if( (mtime2-mtime)> m_TSmax )条件的情况下-但这应该很少见(也许您的编译器使其成为无分支的,在这种情况下,只有一个执行路径).

1 Well except perhaps in the case that the if( (mtime2-mtime)> m_TSmax ) condition is triggered - but this should be rare (and perhaps your compiler has made it branch-free, in which case there is only one execution path).

2 实际上,即使使用硬实时操作系统也不清楚是否可以达到零方差":某些x86特定的因素(如SMM模式和与DVFS相关的停顿)似乎是不可避免的.

2 It's not actually clear you can get to "zero variance" even with a hard realtime OS: some x86-specific factors like SMM mode and DVFS related stalls seem unavoidable.

这篇关于为什么反复调用clock_gettime时会看到400倍的异常时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆