为什么这个延迟循环开始运行更快的几个迭代后没有睡眠？ [英] Why does this delay-loop start to run faster after several iterations with no sleep?

查看：256 发布时间：2016/10/14 23:24:22 c++ linux performance benchmarking

本文介绍了为什么这个延迟循环开始运行更快的几个迭代后没有睡眠？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

请考虑：

  #include< time.h> 
 #include< unistd.h> 
 #include< iostream> 
 using namespace std; 
 
 const int times = 1000; 
 const int N = 100000; 
 
 void run（）{
 for（int j = 0; j< N; j ++）{
} 
} 
 
 int main（）{
 clock_t main_start = clock（）; 
 for（int i = 0; i  clock_t start = clock（）; 
 run（）; 
 cout<< cost：< （clock（） -  start）/ 1000.0<  女士。 << endl; 
 // usleep（1000）; 
} 
 cout<< 总成本：< （clock（） -  main_start）/ 1000.0<  女士。 << endl; 
}

这是示例代码。在定时循环的前26次迭代中， run 函数的花费大约为0.4毫秒，但是成本减少到0.2微秒。

当 usleep 取消注释时，所有运行的延迟循环都需要0.4 ms，不会加速。为什么？

代码使用 g ++ -O0 （无优化）编译，因此延迟循环不是优化了。它运行在Intel（R）Core（TM）上 i3-3220 CPU @ 3.30 GHz，带有3.13.0-32通用 Ubuntu 14.04.1 LTS （Trusty Tahr）。

解决方案

在26次迭代之后，Linux将CPU上升到最大时钟速度，因为您的进程使用完全时间片连续几次。

如果您使用性能计数器而不是挂钟时间进行检查，您会发现每个延迟循环的内核时钟周期保持不变，确认这只是一个 DVFS （所有现代CPU使用在大多数时间以更节能的频率和电压运行）。

如果您在 Skylake 的内核支持新的电源管理模式（硬件完全控制时钟速度），上升

如果您将其在上运行了一段时间英特尔CPU与Turbo ，您可能会看到每次迭代的时间再次略微增加一次，热限制要求时钟速度降低回到最大持续频率。

引入 usleep 可防止 Linux的CPU频率调节器提高时钟速度，因为该过程即使在最低频率下也不会产生100％的负载。（即内核的启发式决定CPU正在运行的工作负载运行得足够快。）

对其他理论的评论：

re：大卫的理论认为，一个可能的情况下从 usleep 开关可能污染的缓存。这不是一般的坏主意，但它并不能帮助解释这段代码

缓存/ TLB污染对此实验并不重要。基本上没有什么内部的定时窗口触及内存而不是堆栈的末尾。大部分的时间都花在一个小环（1行指令缓存），只有涉及一个堆栈存储器 INT 。在 usleep 任何潜在的高速缓存污染的时间，这段代码的一小部分（真正的代码会有所不同）！

更详细的x86：

调用 clock（）本身可能会cache-miss，代码获取缓存未命中延迟了开始时间测量，而不是测量的一部分。到第二次调用时钟（）几乎不会被推迟，因为它仍然应该在缓存中的热点。

run 函数可能位于 main （因为gcc标记 main 为cold，所以它得到更少的优化，并放置与其他冷功能/数据）。我们可以预期一两个指令缓存未命中。他们很可能仍然在同一个页面4K，虽然如此，主将引发潜在的TLB输入程序的计时区域之前错过了。

gcc的-O0将编译OP的代码的这样的东西（Godbolt编译器浏览器）：保持循环计数器在内存中的堆栈。

空循环保持循环计数器在堆栈内存等等，典型的英特尔x86 CPU 循环运行在OP的IvyBridge CPU上的每个约6个周期的一次迭代，由于存储 - 对内存目标（读 - 修改 - 写）添加添加的一部分的转发延迟。 100k次迭代* 6个周期/迭代是600k个周期，这支配了至多一个高速缓存未命中的贡献（每个用于代码获取未命中的约200个周期，

乱序执行和存储转发应该主要隐藏访问堆栈时的潜在缓存未命中（作为 call 指令）。

即使循环计数器保存在寄存器中，也需要100k个周期。 / p>

Consider:

#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;

const int times = 1000;
const int N = 100000;

void run() {
  for (int j = 0; j < N; j++) {
  }
}

int main() {
  clock_t main_start = clock();
  for (int i = 0; i < times; i++) {
    clock_t start = clock();
    run();
    cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
    //usleep(1000);
  }
  cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}

Here is the example code. In the first 26 iterations of the timing loop, the run function costs about 0.4 ms, but then the cost reduces to 0.2 ms.

When the usleep is uncommented, the delay-loop takes 0.4 ms for all runs, never speeding up. Why?

The code is compiled with g++ -O0 (no optimization), so the delay loop isn't optimized away. It's run on Intel(R) Core(TM) i3-3220 CPU @ 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).

解决方案

After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.

If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).

If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.

If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency.

Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)

comments on other theories:

re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.

Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!

In more detail for x86:

The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.

The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.

gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.

The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).

Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).

Even if the loop-counter was kept in a register, 100k cycles is a lot.

这篇关于为什么这个延迟循环开始运行更快的几个迭代后没有睡眠？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么这个延迟循环开始运行更快的几个迭代后没有睡眠？ [英] Why does this delay-loop start to run faster after several iterations with no sleep?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

为什么这个延迟循环开始运行更快的几个迭代后没有睡眠？ [英] Why does this delay-loop start to run faster after several iterations with no sleep?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭